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Dedication 


To Joseph W. Gauld, my high school algebra teacher, for sparking in me a delight in the simple things in 
mathematics. 


Foreword 


When I first got a summer job at MIT's Project MAC almost 30 years ago, I was delighted to be able to work 
with the DEC PDP-10 computer, which was more fun to program in assembly language than any other 
computer, bar none, because of its rich yet tractable set of instructions for performing bit tests, bit masking, 
field manipulation, and operations on integers. Though the PDP-10 has not been manufactured for quite some 
years, there remains a thriving cult of enthusiasts who keep old PDP-10 hardware running and who run old 
PDP-10 software—entire operating systems and their applications—by using personal computers to simulate 
the PDP-10 instruction set. They even write new software; there is now at least one Web site whose pages are 
served up by a simulated PDP-10. (Come on, stop laughing—it's no sillier than keeping antique cars running.) 


I also enjoyed, in that summer of 1972, reading a brand-new MIT research memo called HAKMEM, a bizarre 
[1] 
and eclectic potpourri of technical trivia. The subject matter ranged from electrical circuits to number theory, 
but what intrigued me most was its small catalog of ingenious little programming tricks. Each such gem would 
typically describe some plausible yet unusual operation on integers or bit strings (such as counting the 1-bits in 
a word) that could easily be programmed using either a longish fixed sequence of machine instructions or a 
loop, and then show how the same thing might be done much more cleverly, using just four or three or two 
carefully chosen instructions whose interactions are not at all obvious until explained or fathomed. For me, 
devouring these little programming nuggets was like eating peanuts, or rather bonbons—I just couldn't stop— 
and there was a certain richness to them, a certain intellectual depth, elegance, even poetry. 


[1] Why "HAKMEM"? Short for "hacks memo"; one 36-bit PDP-10 word could hold six 6-bit characters, so a lot of the 
names PDP-10 hackers worked with were limited to six characters. We were used to glancing at a six-character 
abbreviated name and instantly decoding the contractions. So naming the memo "HAKMEM" made sense at the time 
—at least to the hackers. 


"Surely," I thought, "there must be more of these," and indeed over the years I collected, and in some cases 
discovered, a few more. "There ought to be a book of them." 


I was genuinely thrilled when I saw Hank Warren's manuscript. He has systematically collected these little 
programming tricks, organized them thematically, and explained them clearly. While some of them may be 
described in terms of machine instructions, this is not a book only for assembly language programmers. The 
subject matter is basic structural relationships among integers and bit strings in a computer and efficient 
techniques for performing useful operations on them. 


These techniques are just as useful in the C or Java programming languages as they are in assembly language. 


Many books on algorithms and data structures teach complicated techniques for sorting and searching, for 
maintaining hash tables and binary trees, for dealing with records and pointers. They overlook what can be 
done with very tiny pieces of data—bits and arrays of bits. It is amazing what can be done with just binary 
addition and subtraction and maybe some bitwise operations; the fact that the carry chain allows a single bit to 
affect all the bits to its left makes addition a peculiarly powerful data manipulation operation in ways that are 


not widely appreciated. 


Yes, there ought to be a book about these techniques. Now it is in your hands, and it's terrific. If you write 
optimizing compilers or high-performance code, you must read this book. You otherwise might not use this bag 
of tricks every single day—but if you find yourself stuck in some situation where you apparently need to loop 
over the bits in a word, or to perform some operation on integers and it just seems harder to code than it ought, 
or you really need the inner loop of some integer or bit-fiddly computation to run twice as fast, then this is the 
place to look. Or maybe you'll just find yourself reading it straight through out of sheer pleasure. 


Guy L. Steele, Jr. 
Burlington, Massachusetts 
April 2002 


Preface 


Caveat Emptor: The cost of software maintenance increases with the square of the programmer's creativity. 
—First Law of Programmer Creativity, Robert D. Bliss, 1992 


This is a collection of small programming tricks that I have come across over many years. Most of them will 
work only on computers that represent integers in two's-complement form. Although a 32-bit machine is 
assumed when the register length is relevant, most of the tricks are easily adapted to machines with other 
register sizes. 


This book does not deal with large tricks such as sophisticated sorting and compiler optimization techniques. 
Rather, it deals with small tricks that usually involve individual computer words or instructions, such as 
counting the number of 1-bits in a word. Such tricks often use a mixture of arithmetic and logical instructions. 


It is assumed throughout that integer overflow interrupts have been masked off, so they cannot occur. C, 
Fortran, and even Java programs run in this environment, but Pascal and ADA users beware! 


The presentation is informal. Proofs are given only when the algorithm is not obvious, and sometimes not even 
then. The methods use computer arithmetic, "floor" functions, mixtures of arithmetic and logical operations, 
and so on. Proofs in this domain are often difficult and awkward to express. 


To reduce typographical errors and oversights, many of the algorithms have been executed. This is why they 
are given in a real programming language, even though, like every computer language, it has some ugly 
features. C is used for the high-level language because it is widely known, it allows the straightforward mixture 
of integer and bit-string operations, and C compilers that produce high-quality object code are available. 


Occasionally, machine language is used. It employs a three-address format, mainly for ease of readability. The 
assembly language used is that of a fictitious machine that is representative of today's RISC computers. 


Branch-free code is favored. This is because on many computers, branches slow down instruction fetching and 
inhibit executing instructions in parallel. Another problem with branches is that they may inhibit compiler 
optimizations such as instruction scheduling, commoning, and register allocation. That is, the compiler may be 
more effective at these optimizations with a program that consists of a few large basic blocks rather than many 
small ones. 


The code sequences also tend to favor small immediate values, comparisons to zero (rather than to some other 
number), and instruction-level parallelism. Although much of the code would become more concise by using 
table lookups (from memory), this is not often mentioned. This is because loads are becoming more expensive 
relative to arithmetic instructions, and the table lookup methods are often not very interesting (although they 
are often practical). But there are exceptional cases. 


Finally, I should mention that the term "hacker" in the title is meant in the original sense of an aficionado of 
computers—someone who enjoys making computers do new things, or do old things in a new and clever way. 
The hacker is usually quite good at his craft, but may very well not be a professional computer programmer or 
designer. The hacker's work may be useful or may be just a game. As an example of the latter, more than one 

1 
determined hacker has written a program which, when executed, writes out an exact copy of itself. a This is 
the sense in which we use the term "hacker." If you're looking for tips on how to break into someone else's 
computer, you won't find them here. 


[1] The shortest such program written in C, known to the present author, is by Vlad Taeerov and Rashit Fakhreyev and 
is 64 characters in length: 


main(a){printf(a,34,a="main(a){printf(a,34,a=%c%s%c,34);}",34);} 
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Chapter 1. Introduction 


Notation 


Instruction Set and Execution Time Model 


1-1 Notation 


This book distinguishes between mathematical expressions of ordinary arithmetic and those that describe the 
operation of a computer. In "computer arithmetic," operands are bit strings, or bit vectors, of some definite 
fixed length. Expressions in computer arithmetic are similar to those of ordinary arithmetic, but the variables 
denote the contents of computer registers. The value of a computer arithmetic expression is simply a string of 
bits with no particular interpretation. An operator, however, interprets its operands in some particular way. For 
example, a comparison operator might interpret its operands as signed binary integers or as unsigned binary 
integers; our computer arithmetic notation uses distinct symbols to make the type of comparison clear. 


The main difference between computer arithmetic and ordinary arithmetic is that in computer arithmetic, the 
results of addition, subtraction, and multiplication are reduced modulo 2”, where n is the word size of the 
machine. Another difference is that computer arithmetic includes a large number of operations. In addition to 
the four basic arithmetic operations, computer arithmetic includes logical and, exclusive or, compare, shift left, 
and so on. 


Unless specified otherwise, the word size is 32 bits, and signed integers are represented in two's-complement 
form. 


Expressions of computer arithmetic are written similarly to those of ordinary arithmetic, except that the 
variables that denote the contents of computer registers are in bold-face type. This convention is commonly 
used in vector algebra. We regard a computer word as a vector of single bits. Constants also appear in bold-face 
type when they denote the contents of a computer register. (This has no analogy with vector algebra because in 
vector algebra the only way to write a constant is to display the vector's components.) When a constant denotes 
part of an instruction, such as the immediate field of a shift instruction, light-face type is used. 


If an operator such as "+" has bold-face operands, then that operator denotes the computer's addition operation 
("vector addition"). If the operands are light-faced, then the operator denotes the ordinary scalar arithmetic 
operation. We use a light-faced variable x to denote the arithmetic value of a bold-faced variable x under an 
interpretation (signed or unsigned) that should be clear from the context. Thus, if x = 0x80000000 and y = 
0x80000000, then, under signed integer interpretation, x = y = -231, x + y = -232, and x + y= 0. Here, 
0x80000000 is hexadecimal notation for a bit string consisting of a 1-bit followed by 31 0-bits. 


Bits are numbered from the right, with the rightmost (least significant) bit being bit 0. The terms "bits," 
"nibbles," "bytes," "halfwords," "words," and "doublewords" refer to lengths of 1, 4, 8, 16, 32, and 64 bits, 
respectively. 


Short and simple sections of code are written in computer algebra, using its assignment operator (left arrow) 
and occasionally an if statement. In this role, computer algebra is serving as little more than a machine- 
independent way of writing assembly language code. 


Longer or more complex computer programs are written in the C++ programming language. None of the object- 
g p p prog prog g languag J 


oriented features of C** are used; the programs are basically in C with comments in C** style. When the 
distinction is unimportant, the language is referred to simply as "C." 


A complete description of C would be out of place in this book, but Table 1-1 contains a brief summary of most 
of the elements of C [H&S] that are used herein. This is provided for the benefit of the reader who is familiar 


with some procedural programming language but not with C. Table 1-1 also shows the operators of our 


computer-algebraic arithmetic language. Operators are listed from highest precedence (tightest binding) to 
lowest. In the Precedence column, L means left-associative; that is, 


a*b*c=(a*b)*c 


and R means right-associative. Our computer-algebraic notation follows C in precedence and associativity. 


In addition to the notations described in Table 1-1, those of Boolean algebra and of standard mathematics are 
used, with explanations where necessary. 


Table 1-1. Expressions of C and Computer Algebra 


C Ha Algebra Description 
pe 0b... Hexadecimal, binary constants 
| Selecting the kth component 


D Xis ee pren variables, or bit selection (clarified in text) 


f(x, ...) Function evaluation 


Precedence 


Hilt 


abs(x) bsolute value (but abs(-231) = -231) 


nabs(x) egative of the absolute value 


i 


T 


Postincrement, decrement 


Preincrement, decrement 


J 


to the kth power 


a ia name ) X f ae 


itwise not (one's-complement) 


pe not (if x = 0 then 1 else 0) 


rithmetic negation 


ý ultiplication, modulo word size 


Signed integer division 


13L x/y nsigned integer division 

13L x%y Remainder (may be negative), of (x + y) signed 
arguments 

x% uy 
13 L oy rem(x, y) Remainder of * + > unsigned arguments 
mod(x, y) reduced modulo y to the interval [0, abs(y) - 1]; 

signed arguments 

12L Eye AS y 

11L 


xS y Shift right with sign-fill ("arithmetic" or "algebraic" 
i shift) 


xm yx a y Rotate shift left, right 


11L 


= 


OL <= Signed comparison 


= 


OL = nsigned comparison 


j L pas inequality 
i piws and 


7L itwise exclusive or 


itwise equivalence (~(x )) 


itwise or 


m 

—_ ie 

E Conditional and (if x = 0 then 0 else if y = 0 then 0 
else 1) 

Co rT 


Conditional or (if x JÉ0 then 1 else if y 0 then 1 
else 0) 


L | Foncstenaton 
2 = f ssignment 


Our computer algebra uses other functions, in addition to "abs," "rem," and so on. These are defined where 
introduced. 


In C, the expression X < y < Z means to evaluate X < Y to a 0/1-valued result, and then compare that result to 
Z. In computer algebra, the expression x < y < z means (x < y) & (y < Z). 


C has three loop control statements: while, do, and for. The while statement is written: 


while (expression) statement 


First, expression is evaluated. If true (nonzero), statement is executed and control returns to evaluate 
expression again. If expression is false (0), the while-loop terminates. 


The dO statement is similar, except the test is at the bottom of the loop. It is written: 
do statement while (expression) 


First, statement is executed, and then expression is evaluated. If true, the process is repeated, and if false, the 
loop terminates. 


The for statement is written: 
FOr (e4; e2; e3) statement 


First, e4, usually an assignment statement, is executed. Then e>, usually a comparison, is evaluated. If false, the 
for-loop terminates. If true, statement is executed. Finally, e3, usually an assignment statement, is executed, 


and control returns to evaluate e» again. Thus, the familiar "do i = 1 to n" is written: 
for (i = 1; i <= n; i++) 


(This is one of the few contexts in which we use the postincrement operator.) 


1-2 Instruction Set and Execution Time Model 


To permit a rough comparison of algorithms, we imagine them being coded for a machine with an instruction 
set similar to that of today's general purpose RISC computers, such as the Compaq Alpha, the SGI MIPS, and 
the IBM RS/6000. The machine is three-address and has a fairly large number of general purpose registers— 
that is, 16 or more. Unless otherwise specified, the registers are 32 bits long. General register 0 contains a 
permanent 0, and the others can be used uniformly for any purpose. 


In the interest of simplicity there are no "special purpose" registers, such as a condition register or a register to 
hold status bits, such as "overflow." No floating-point operations are described, because that is beyond the 
scope of this book. 


We recognize two varieties of RISC: a "basic RISC," having the instructions shown in Table 1-2, and a "full 
RISC," having all the instructions of the basic RISC plus those shown in Table 1-3. 


Table 1-2. Basic RISC Instruction Set 


| Opcode Mnemonic | Operands | Description 


sub, mul, div, divu, RT,RA,RB |RT €=RA op RB, where Op is add, 


subtract, multiply, divide signed, divide 
unsigned, remainder signed, or remainder 
unsigned. 


addi, muli RT,RA,I |RT #=RA op I, where Op is add or 
multiply, and I is a 16-bit signed immediate 
alue. 
ee en RA,L [RTRA + (I || 0x0000). 
and, or, xor RT,RA,RB |RT €=RA op RB, where Op is bitwise and, 
or, or exclusive or. 


andi, GFi, xori RT, RA, Iu s above, except the last operand is a 16-bit 
nsigned immediate value. 


beq, bne, bit, ble, bgt, bge RT, target Branch to target if RT = 0, or if RT 0, or if 
RT <0, or if RT SSO, or if RT > 0, or if RT 


20 (signed integer interpretation of RT). 


if false and 1 if true. Mnemonics denote 
compare for equality, inequality, less than, and 
so on, as for the branch instructions, and in 
addition, the suffix "U" denotes an unsigned 


cmpieq, cmpine, cmpilt, 
cmpile, cmpigt, cmpige 


Like Cmpeq, and so on, except the second 
comparand is a 16-bit signed immediate value. 


cmpiequ, cmpineu, cmpiltu, 
cmpileu, cmpigtu, cmpigeu 


comparand is a 16-bit unsigned immediate 
alue. 


ldbu, ldh, ldhu, ldw Load an unsigned byte, signed halfword, 


nsigned halfword, or word into RT from 
memory at location RA + d, where d is a 16- 
it signed immediate value. 


RT RA shifted left or right by the amount 
given in the rightmost six bits of RB; 0-fill 
except for Shr S, which is sign-fill. (The shift 
amount is treated modulo 64.) 


RT,RA, IU |RT RA shifted left or right by the amount 
given in the 5-bit immediate field. 


stb, Sth, stw RS,d(RA 


Store a byte, halfword, or word, from RS into 
memory at location RA + d, where d is a 16- 
it signed immediate value. 


In these brief instruction descriptions, RA and RB appearing as source operands really means the contents of 
those registers. 


A real machine would have branch and link (for subroutine calls), branch to the address contained in a register 
(for subroutine returns and "switches"), and possibly some instructions for dealing with special purpose 
registers. It would, of course, have a number of privileged instructions and instructions for calling on supervisor 
services. It might also have floating-point instructions. 


Some other computational instructions that a RISC computer might have are identified in Table 1-3. These are 
discussed in later chapters. 


Table 1-3. Additional Instructions for the "Full RISC" 


Opcode Mnemonic Operands Description 
RT, RA RT gets the absolute value, or the negative of 
he absolute value, of RA. 
eqv, nand, nor, orc RT, RA, RB [Bitwise and with complement (of RB), 
equivalence, negative and, negative or, and or 
ith complement. 
extr RT, RA, I, L (Extract bits I through I+L-1 of RA, and 


lace them right-adjusted in RT, with 0-fill. 


RT gets the number of leading 0's in RA (0 to 
32). 


por iai RA T gets the number of 1-bits in RA (0 to 32). 


oad a signed byte into RT from memory at 
ocation RA + d, where d is a 16-bit signed 
immediate value. 


movne, movlt, movle, RT €=RB if RA = 0, or if RA 0, and so on, 
movgt, movge else RT is unchanged. 


RT RA rotate-shifted left or right by the 


amount given in the rightmost five bits of RB. 


shiri, shrri RT RA rotate-shifted left or right by the 


amount given in the 5-bit immediate field. 


trpne, trplt, trple, 
trpgt, trpge, trpltu, trpleu, 
trpgtu, trpgeu 


Trap (interrupt) if RA = RB, or RA JÉRB, 
and so on. 


RA, I ike trpeq, and so on, except the second 
comparand is a 16-bit signed immediate value. 


trpigtu, trpigeutrpiequ, RA, Iu Like trp1tu, and so on, except the second 
trpineu, trpiltu, trpileu, comparand is a 16-bit unsigned immediate 
alue. 


It is convenient to provide the machine's assembler with a few "extended mnemonics." These are like macros 
whose expansion is usually a single instruction. Some possibilities are shown in Table 1-4. 


Table 1-4. Extended Mnemonics 


Expansion Description 


j eq RO, target ue branch. 
i RT, I See text oad immediate, -231 ŠI < 232, 


Extended Mnemonic 


target 


mov RI,RA Ori RT,RA,9O ove register RA to RT. 


ee RT, RA po” RT, RO, RA pone (two's-complement). 


PURE RERE ubtract immediate ( I 3-215), 


The load immediate instruction expands into one or two instructions, as required by the immediate value I. For 


example, if 0 S< 216, an or immediate (OY 1) from RO can be used. If -215 S< 0, an add immediate 
(addi) from RO can be used. If the rightmost 16 bits of I are 0, add immediate shifted (addis) can be used. 
Otherwise, two instructions are required, such as addis followed by Or i. (Alternatively, in the last case a 


load from memory could be used, but for execution time and space estimates we assume that two elementary 
arithmetic instructions are used.) 


Of course, which instructions belong in the basic RISC, and which belong in the full RISC is very much a 
matter of judgment. Quite possibly, divide unsigned and the remainder instructions should be moved to the full 
RISC category. Shift right signed is another suspicious instruction, given its low frequency of use in the SPEC 
benchmarks. The trouble is, in C it is easy to accidentally use these instructions, by doing a division with 
unsigned operands when they could just as well be signed, and by doing a shift right with a signed quantity 
(int) that could just as well be unsigned. Incidentally, shift right signed (or shift right arithmetic, as it is often 


called) does not do a division of a signed integer by a power of 2; you need to add 1 to the result if the dividend 
is negative and any nonzero bits are shifted out. 


The distinction between basic and full RISC involves many other such questionable judgments, but we won't 
dwell on them. 


The instructions are limited to two source registers and one target, which simplifies the computer (e.g., the 
register file requires no more than two read ports and one write port). It also simplifies an optimizing compiler, 
because the compiler does not need to deal with instructions that have multiple targets. The price paid for this is 
that a program that wants both the quotient and remainder of two numbers (not uncommon) must execute two 
instructions (divide and remainder). The usual machine division algorithm produces the remainder as a by- 
product, so many machines make them both available as a result of one execution of divide. Similar remarks 
apply to obtaining the doubleword product of two words. 


The conditional move instructions (e.g., MOVeq) ostensibly have only two source operands, but in a sense they 


have three. Because the result of the instruction depends on the values in RT, RA, and RB, a machine that 
executes instructions out of order must treat RT in these instructions as both a use and a set. That is, an 
instruction that sets RT, followed by a conditional move that sets RT, must be executed in that order, and the 
result of the first instruction cannot be discarded. Thus, the designer of such a machine may elect to omit the 
conditional move instructions to avoid having to consider an instruction with (logically) three source operands. 
On the other hand, the conditional move instructions do save branches. 


Instruction formats are not relevant to the purposes of this book, but the full RISC instruction set described 
above, with floating point and a few supervisory instructions added, can be implemented with 32-bit 
instructions on a machine with 32 general purpose registers (5-bit register fields). By reducing the immediate 
fields of compare, load, store, and trap instructions to 14 bits, the same holds for a machine with 64 general 
purpose registers (6-bit register fields). 


Execution Time 


We assume that all instructions execute in one cycle, except for the multiply, divide, and remainder 
instructions, for which we do not assume any particular execution time. Branches take one cycle whether they 
branch or fall through. 


The load immediate instruction is counted as one or two cycles, depending on whether one or two elementary 
arithmetic instructions are required to generate the constant in a register. 


Although load and store instructions are not often used in this book, we assume they take one cycle and ignore 
any load delay (time lapse between when a load instruction completes in the arithmetic unit, and when the 
requested data is available for a subsequent instruction). 


However, knowing the number of cycles used by all the arithmetic and logical instructions is often insufficient 
for estimating the execution time of a program. Execution can be slowed substantially by load delays and by 
delays in fetching instructions. These delays, although very important and increasing in importance, are not 
discussed in this book. Another factor, one which improves execution time, is what is called "instruction-level 
parallelism," which is found in many contemporary RISC chips, particularly those for "high-end" machines. 


These machines have multiple execution units and sufficient instruction-dispatching capability to execute 
instructions in parallel when they are independent (that is, when neither uses a result of the other, and they don't 
both set the same register or status bit). Because this capability is now quite common, the presence of 
independent operations is often pointed out in this book. Thus, we might say that such and such a formula can 
be coded in such a way that it requires eight instructions and executes in five cycles on a machine with 
unlimited instruction-level parallelism. This means that if the instructions are arranged in the proper order 
("scheduled"), a machine with a sufficient number of adders, shifters, logical units, and registers can in 
principle execute the code in five cycles. 


We do not make too much of this, because machines differ greatly in their instruction-level parallelism 
capabilities. For example, an IBM RS/6000 processor from ca. 1992 has a three-input adder, and can execute 
two consecutive add-type instructions in parallel even when one feeds the other (e.g., an add feeding a 
compare, or the base register of a load). As a contrary example, consider a simple computer, possibly for low- 
cost embedded applications, that has only one read port on its register file. Normally, this machine would take 
an extra cycle to do a second read of the register file for an instruction that has two register input operands. 
However, suppose it has a bypass so that if an instruction feeds an operand of the immediately following 
instruction, then that operand is available without reading the register file. On such a machine, it is actually 
advantageous if each instruction feeds the next—that is, if the code has no parallelism. 


Chapter 2. Basics 


Manipulating Rightmost Bits 
Addition Combined with Logical Operations 


Inequalities among Logical and Arithmetic Expressions 


Absolute Value Function 


Sign Extension 


Shift Right Signed from Unsigned 


Sign Function 


Three-Valued Compare Function 
Transfer of Sign 
Decoding a "Zero Means 2**n" Field_ 


Comparison Predicates 


Overflow Detection 


Condition Code Result of Add, Subtract, and_Multiply_ 


Rotate Shifts 


Double-Length Add/Subtract 


Double-Length Shifts 


Multibyte Add, Subtract, Absolute Value_ 


Doz, Max, Min 


Exchanging Registers 


Alternating among Two or More Values 


2-1 Manipulating Rightmost Bits 


Some of the formulas in this section find application in later chapters. 


Use the following formula to turn off the rightmost 1-bit in a word, producing 0 if none (e.g., 01011000 
= 01010000): 


x &(x-1) 


This may be used to determine if an unsigned integer is a power of 2; apply the formula followed by a 0-test on 
the result. 


Similarly, the following formula can be used to test if an unsigned integer is of the form 2” - 1 (including 0 or 
all 1's): 


x&(x+1) 


Use the following formula to isolate the rightmost 1-bit, producing 0 if none (e.g., 01011000 = 00001000): 


x & (—x) 


Use the following formula to isolate the rightmost 0-bit, producing 0 if none (e.g., 10100111 = 00001000): 


—w & (x +1) 


Use one of the following formulas to form a mask that identifies the trailing 0's, producing all 1's if x = 0 (e.g., 
01011000 ="00000111): 


—x & (x—1). or 
=(x | -—x), or 
(x & —x)-1 


The first formula has some instruction-level parallelism. 


Use the following formula to form a mask that identifies the rightmost 1-bit and the trailing 0's, producing all 
1's if x = 0 (e.g., 01011000 ="00001111): 


x®(x-1) 


Use the following formula to right-propagate the rightmost 1-bit, producing all 1's if x = 0 (e.g., 01011000 
01011111): 


x | (x-1) 


Use the following formula to turn off the rightmost contiguous string of 1-bits (e.g., 01011000 = 01000000): 


((x | (x-1)+D&x 


This may be used to see if a nonnegative integer is of the form 2/ - 2k for some j =k 2o; apply the formula 
followed by a 0-test of the result. 


These formulas all have duals in the following sense. Read what the formula does, interchanging 1's and 0's in 
the description. Then, in the formula, replace x - 1 with x + 1, x + 1 with x - 1, -x with ~(x + 1), & with |, and | 
with &. Leave x and ~x alone. Then the result is a valid description and formula. For example, the dual of the 
first formula in this section reads as follows: 


Use the following formula to turn on the rightmost 0-bit in a word, producing all 1's if none (e.g., 10100111 
= 10101111): 


x | (x+1) 


There is a simple test to determine whether or not a given function can be implemented with a sequence of 
add's, subtract's, and's, or's, and not's [War]. We may, of course, expand the list with other instructions that can 
be composed from the basic list, such as shift left by a fixed amount (which is equivalent to a sequence of 
add's), or multiply. However, we exclude instructions that cannot be composed from the list. The test is 
contained in the following theorem. 


Theorem. A function mapping words to words can be implemented with word-parallel add, subtract, and, or, 
and not instructions if and only if each bit of the result depends only on bits at and to the right of each input 
operand. 


That is, imagine trying to compute the rightmost bit of the result by looking only at the rightmost bit of each 
input operand. Then, try to compute the next bit to the left by looking only at the rightmost two bits of each 
input operand, and so forth. If you are successful in this, then the function can be computed with a sequence of 
add's, and's, and so on. If the function cannot be computed in this right-to-left manner, then it cannot be 
implemented with a sequence of such instructions. 


The interesting part of this is the latter statement, and it is simply the contrapositive of the observation that the 
functions add, subtract, and, or, and not can all be computed in the right-to-left manner, so any combination of 
them must have this property. 


To see the "if" part of the theorem, we need a construction that is a little awkward to explain. We illustrate it 
with a specific example. Suppose that a function of two variables x and y has the right-to-left computability 
property, and suppose that bit 2 of the result r is given by 


Equation 1 


ry = xy | (%)&)). 


We number bits from right to left, 0 to 31. Because bit 2 of the result is a function of bits at and to the right of 
bit 2 of the input operands, bit 2 of the result is "right-to-left computable." 


Arrange the computer words x, x shifted left two, and y shifted left one, as shown below. Also, add a mask that 
isolates bit 2. 


Ayl Aa ee ie A oe 
Nog Xag Xp Xy V OV 
V30 Vag e Va Yı Yo Y 
0 O .. O 10 0 
0 0... 0r, 0 0 


Now, form the word-parallel and of lines 2 and 3, or the result with row 1 (following Equation (1)), and and the 
result with the mask (row 4 above). The result is a word of all 0's except for the desired result bit in position 2. 
Perform similar computations for the other bits of the result, or the 32 resulting words together, and the result is 
the desired function. 


This construction does not yield an efficient program; rather, it merely shows that it can be done with 
instructions in the basic list. 


Using the theorem, we immediately see that there is no sequence of such instructions that turns off the leftmost 
1-bit in a word, because to see if a certain 1-bit should be turned off, we must look to the left to see if it is the 
leftmost one. Similarly, there can be no such sequence for performing a right shift, or a rotate shift, or a left 
shift by a variable amount, or for counting the number of trailing 0's in a word (to count trailing 0's, the 
rightmost bit of the result will be 1 if there are an odd number of trailing 0's, and we must look to the left of the 
rightmost position to determine that). 


A novel application of the sort of bit twiddling discussed above is the problem of finding the next higher 
number after a given number that has the same number of 1-bits. You are forgiven if you are asking, "Why on 
earth would anyone want to compute that?" It has application where bit strings are used to represent subsets. 
The possible members of a set are listed in a linear array, and a subset is represented by a word or sequence of 
words in which bit i is on if member i is in the subset. Set unions are computed by the logical or of the bit 
strings, intersections by and's, and so on. 


You might want to iterate through all the subsets of a given size. This is easily done if you have a function that 
maps a given subset to the next higher number (interpreting the subset string as an integer) with the same 
number of 1-bits. 


[1] 
A concise algorithm for this operation was devised by R. W. Gosper [HAK, item 175]. | Given a word x that 
represents a subset, the idea is to find the rightmost contiguous group of 1's in x and the following 0's, and 
"increment" that quantity to the next value that has the same number of 1's. For example, the string xxx0 1111 
0000, where xxx represents arbitrary bits, becomes xxx1 0000 0111. The algorithm first identifies the 
"smallest" 1-bit in x, with s = x & -x, giving 000000010000. This is added to x, giving r = xxx100000000. The 
1-bit here is one bit of the result. For the other bits, we need to produce a right-adjusted string of n - 1 1's, 
where n is the size of the rightmost group of 1's in x. This can be done by first forming the exclusive or of r and 


x, which gives 0001 1111 0000 in our example. 


[1] A variation of this algorithm appears in [H&S] sec. 7.6.7. 


This has two too many 1's, and needs to be right-adjusted. This can be accomplished by dividing it by s, which 
right-adjusts it (s is a power of 2), and shifting it right two more positions to discard the two unwanted bits. The 
final result is the or of this and r. 


In computer algebra notation, the result is y in 
Equation 2 


sex &—-x 
restx 
yer | (((x®r)$2) 4s) 


A complete C procedure is given in Figure 2-1. It executes in seven basic RISC instructions, one of which is 
division. (Do not use this procedure with x = 0; that causes division by 0.) 


Figure 2-1 Next higher number with same number of 1-bits. 
unsigned snoob(unsigned x) { 


unsigned smallest, ripple, ones; 
// X = XxxO 1111 0000 


smallest = x & -x; ca 0000 0001 0000 
ripple = x + smallest; i 4 xxx1 0000 0000 
ones = x ^ ripple; // 0001 1111 0000 
ones = (ones >> 2)/smallest; // 0000 0000 0111 
return ripple | ones; // xxx1 0000 0111 


If division is slow but you have a fast way to compute the number of trailing zeros function ntz(x), the number 
of leading zeros function nlz(x), or population count (pop(x) is the number of 1-bits in x), then the last line of 
Equation (2) can be replaced with one of the following: 


yer | ((x @r) = (2 + ntz(x))) 
yer | (x ®r) = (33 —-nla(s))) 
yer | ((1<(pop(x @ r)-2))-1) 


2-2 Addition Combined with Logical Operations 


We assume the reader is familiar with the elementary identities of ordinary algebra and Boolean algebra. Below 
is a selection of similar identities involving addition and subtraction combined with logical operations: 


a. -x = =axt! 
b. = —(x-—1) 
g ax = -x-1 
d —aAx = xt1 

e =x = x-i 


rty = x-any-i 


f. 

. =(x@y)+2(x& y) 

À = (x | y) +(x & y) 

i =2(x | y)-(x@y) 

A x-y=x+ny+i 

5 = (x@®y)-2(~x & y) 
i = (x& ay)-(Ax & y) 
M =2(x&aAy)-(x@ y) 


xy = (x | p)-(x& y) 


x&ayp=(x| y)- F 


O. 

i =x-(x& y) 

q. =x- y) = p-x-1 

r. -ARP 

n x=y = (x& y)-(x | y)-1 
t. = (x& y) +(x | y) 
. x| y= (e&ayp) ty 

P x& y = (Ax | y) -ax 


Equation (d) may be applied to itself repeatedly, giving ----x = x + 2, and so on. Similarly, from (e) we have - 
~=-~7-X = x - 2. So we can add or subtract any constant, using only the two forms of complementation. 


Equation (f) is the dual of (j), where (j) is the well-known relation that shows how to build a subtracter from an 
adder. 


Equations (g) and (h) are from HAKMEM memo [HAK, item 23]. Equation (g) forms a sum by first computing 


the sum with carries ignored (x ®,) and then adding in the carries. Equation (h) is simply modifying the 
addition operands so that the combination 0 + 1 never occurs at any bit position; it is replaced with 1 + 0. 


It can be shown that in the ordinary addition of binary numbers with each bit independently equally likely to be 
0 or 1, a carry occurs at each position with probability about 0.5. However, for an adder built by 
preconditioning the inputs using (g), the probability is about 0.25. This observation is probably not of value in 
building an adder, because for that purpose the important characteristic is the maximum number of logic 
circuits the carry must pass through, and using (g) reduces the number of stages the carry propagates through by 
only one. 


Equations (k) and (1) are duals of (g) and (h), for subtraction. That is, (k) has the interpretation of first forming 


the difference ignoring the borrows (x B, and then subtracting the borrows. Similarly, Equation (l) is simply 
modifying the subtraction operands so that the combination 1 - 1 never occurs at any bit position; it is replaced 
with 0 - 0. 


Equation (n) shows how to implement exclusive or in only three instructions on a basic RISC. Using only and- 
or-not logic requires four instructions ((x | y) & 7(x & y)). Similarly, (u) and (v) show how to implement and 
and or in three other elementary instructions, whereas using DeMorgan's laws requires four. 


2-3 Inequalities among Logical and Arithmetic Expressions 


Inequalities among binary logical expressions whose values are interpreted as unsigned integers are nearly 
trivial to derive. Here are two examples: 


(x@®y)s(x | y), and 
(x& y)S(x=y), 


These can be derived from a list of all binary logical operations, shown in Table 2-1. 


Let f(x, y) and g(x, y) represent two columns in Table 2-1. If for each row in which f(x, y) is 1, g(x, y) also is 1, 
then for all (x, y), fix, y) Sal, ¥)-Clearly, this extends to word: -parallel logical operations. One can easily 


read off such relations (most of which are trivial) as ‘* & y) sxs | =y) and so on. Furthermore, if two 
columns have a row in which one entry is 0 and the other is 1, and another row in which the entries are 1 and 0, 
respectively, then no inequality relation exists between the corresponding logical expressions. So the question 


= k li ; , 
of whether or not A(X, ¥) S al, Wis completely and easily solved for all binary logical functions f and g. 


Table 2-1. The 16 Binary Logical Operations 


ee a a aa W a a ba bidi 
M S T T 
tort 


< < 


Use caution when manipulating these relations. For example, for ordinary arithmetic, if x + y =a and z =x, 


< 


then z + y =a. But this inference is not valid if "+" is replaced with or. 


Inequalities involving mixed logical and arithmetic expressions are more interesting. Below is a small selection. 


(x | y) 3 max(x, y) 


a. 
b. (x & y) = minx, y) 

7 (x | y)<x+y ifthe addition does not overflow 
d. (x | y)Sx+y ifthe addition overflows 

e. le- y| S(x@ y) 


E 
The proofs of these are quite simple, except possibly for the relation |x i y! s(x @y )-By |x - y| we mean 
the absolute value of x - y, which may be computed within the domain of unsigned numbers as max(x, y) - min 
(x, y). This relation may be proven by induction on the length of x and y (the proof is a little easier if you 
extend them on the left rather than on the right). 


2-4 Absolute Value Function 


If your machine does not have an instruction for computing the absolute value, this computation can usually be 


5 
done in three or four branch-free instructions. First, compute peri 31 and then one of the following: 
abs nabs 
(x@y)-y y-ix® y) 
(x+y) Ey (y-x) Oy 


x-(2x & y) (2x & y)—x 


By "2x" we mean, of course, x + x or x << 1. 


If you have a fast multiply by a variable whose value is +1, the following will do: 


((x = 30) | lhax 


2-5 Sign Extension 


By "sign extension," we mean to consider a certain bit position in a word to be the sign bit, and we wish to 
propagate that to the left, ignoring any other bits present. The standard way to do this is with shift left logical 
followed by shift right signed. However, if these instructions are slow or nonexistent on your machine, it may 
be done with one of the following, where we illustrate by propagating bit position 7 to the left: 


((x + 0x00000080) & 0x000000FF) — 0x00000080 
((x & Ox000000FF) S 0x00000080) — 0x0000 0080 


The "+" above can also be "-" or "De The second formula is particularly useful if you know that the unwanted 
high-order bits are all 0's, because then the and can be omitted. 


2-6 Shift Right Signed from Unsigned 


If your machine does not have the shift right signed instruction, it may be computed using the formulas shown 
below. The first formula is from [GM], and the second is based on the same idea. Assuming the machine has 


mod 64 shifts, the first four formulas hold for 0 = = n 331, and the last holds for 0 “sn “363. The last formula 
holds for any n if by "holds" we mean "treats the shift amount to the same modulus as does the logical shift." 


When n is a variable, each formula requires five or six instructions on a basic RISC. 


((x + 0x80000000) £ 7r) — (0x80000000 $- 1) 
t — 0x80000000 = 7; ((xs>n)@r)-t 
f<— (x & Ox80000000) = n; (x= a)-—(f+0 
(x= n) | (-(x31)«<31-n) 

fe-(x> 31); (xe) ner 


f ul 
In the first two formulas, an alternative for the expression Ox80000000 => 2 is 1 << 31-n. 


If n is a constant, the first two formulas require only three instructions on many machines. If n = 31, the 


EU o 
function can be done in two instructions with (x = 3|] ). 


2-7 Sign Function 


The sign, or signum, function is defined by 


-l, x< 0, 
sign(x) = 4 0, x=0, 
l, «>, 


It may be calculated with four instructions on most machines [Hop]: 


(x= 31) | (x= 31) 


If you don't have shift right signed, then use the substitute noted at the end of Section 2-6, giving the following 
nicely symmetric formula (five instructions): 


(x$ 31) | Cx $31) 


Comparison predicate instructions permit a three-instruction solution, with either 
Equation 3 


(x> 0)- {x< 0), or 
(x2 0)-{x£0). 


F di i Ga 
Finally, we note that the formula (x31 ) E (x =æ 3] Jaiga works; it fails only for x = -231, 


2-8 Three-Valued Compare Function 


The three-valued compare function, a slight generalization of the sign function, is defined by 


-l XEF, 
cmp, y) =4 0, x=y, 
l, xXx>y 


There are both signed and unsigned versions, and unless otherwise specified, this section applies to both. 


Comparison predicate instructions permit a three-instruction solution, an obvious generalization of Equations 


(3); 


(x>y)-(¥< yp), OF 
(x2 y)-(*¥ Sy). 


A solution for unsigned integers on PowerPC is shown below [CWG]. On this machine, "carry" is "not borrow." 


subf R5,Ry,RX # R5 <-- Rx - Ry. 

subfc R6,Rx,Ry # R6 <-- Ry - Rx, set Carry. 

subfe R7,Ry,RxX # R7 <-- Rx - Ry + Carry, set carry. 
subfe R8,R7,R5 # R8 <-- R5 - R7 + carry, (set carry). 


If limited to the instructions of the basic RISC, there does not seem to be any particularly good way to compute 


this function. The comparison predicates x < y, x Sy, and so on, require about five instructions (see Section 2- 
11), leading to a solution in about 12 instructions (using a small amount of commonality in computing x < y 
and x > y). On the basic RISC it's probably preferable to use compares and branches (six instructions executed 
worst case if compares can be commoned). 


2-9 Transfer of Sign 
The transfer of sign function, called ISIGN in Fortran, is defined by 


| bs 5 ; = 0, 
ISIGN(x, yy = d BOB 3 
—abs(x), <0. 


This function can be calculated (modulo 232) with four instructions on most machines: 


f— po 31; te—(x@ y) 31; 
ISIGN(x, y) = (abs(x) © r) -i ISIGN(x, y) = (x @rj-t 


= (abs(x) +r) ®t =(x+nOr 


2-10 Decoding a "Zero Means 2**n" Field 


Sometimes a 0 or negative value does not make much sense for a quantity, so it is encoded in an n-bit field with 
a 0 value being understood to mean 2”, and a non-zero value having its normal binary interpretation. An 
example is the length field of PowerPC's load string word immediate (1 SW1) instruction, which occupies five 
bits. It is not useful to have an instruction that loads zero bytes, when the length is an immediate quantity, but it 
is definitely useful to be able to load 32 bytes. The length field could be encoded with values from 0 to 31 
denoting lengths from 1 to 32, but the "zero means 32" convention results in simpler logic when the processor 
must also support a corresponding instruction with a variable (in-register) length that employs straight binary 
encoding (e.g., PowerPC's 1 Swx instruction). 


It is trivial to encode an integer in the range 1 to 2” into the "zero means 2”" encoding—simply mask the 
integer with 2” - 1. To do the decoding without a test-and-branch is not quite as simple, but below are some 
possibilities (no doubt overdone), illustrated for a 3-bit field. They all require three instructions, not counting 
possible loads of constants. 


((x-1) &7)+1 ((x+7) | -8)+9 8 —(—x & 7) 


((x+ 7) & 7)+1 ((x+7) | 8-7 -{-x | -8) 
((x-1) | -—8)+9 ((x-]1)&8)+x 


2-11 Comparison Predicates 


A "comparison predicate" is a function that compares two quantities, producing a single bit result of 1 if the 
comparison is true, and 0 if the comparison is false. Below we show branch-free expressions to evaluate the 
result into the sign position. To produce the 1/0 value used by some languages (e.g., C), follow the code with a 
shift right of 31. To produce the -1/0 result used by some other languages (e.g., Basic), follow the code with a 
shift right signed of 31. 


These formulas are, of course, not of interest on machines such as MIPS, the Compaq Alpha, and our model 
RISC, which have comparison instructions that compute many of these predicates directly, placing a 0/1-valued 
result in a general purpose register. 


A machine instruction that computes the negative of the absolute value is handy here. We show this function as 
"nabs." Unlike absolute value, it is well defined in that it never overflows. Machines that do not have "nabs" but 
have the more usual "abs" can use -abs(x) for nabs(x). If x is the maximum negative number, this overflows 
twice, but the result is correct. (We assume that the absolute value and the negation of the maximum negative 
number is itself.) Because some machines have neither "abs" nor "nabs," we give an alternative that does not 
use them. 


The "nlz" function is the number of leading zeros in its argument. The "doz" function (difference or zero) is 
described on page 37. 


x=yp: abs(x — y)- 1 
abs(x — y + 0x80000000) 
nla(x = y) << 26 
-(nlz(x — y) = 5) 


—a(x-—y | y-x) 


xy nabs(x — y) 
nlza(¥ — y)— 32 
x-y | p-x 


x<y: (x-y) B[(x Oy) & ((x-y) OX) 
(x& ay) | (vey) & (x-y)) 


nabs(doz(y, x)) [GSO] 
XSYy: (x | ay) & ((x@ y) | a(y-~x)) 

((x=y) o> 1l)+(x&y) [GSO] 
Kay: (ax & y) | (x= yp) & (x—- y)) 

(ax & y) | (ax | y) & (x-y)) 
xy: (ax | y)&((x@y) | -(y--x)) 


For x > y, x 2y, and so on, interchange x and y in the formulas for x < y x Sy, and so on. The add of 
0x80000000 may be replaced with any instruction that inverts the high-order bit (in x, y, or x - y). 


Another class of formulas can be derived from the observation that the predicate x < y is given by the sign of 
x/2 - y/2, and the subtraction in that expression cannot overflow. The result can be fixed up by subtracting 1 in 
the cases in which the shifts discard essential information, as follows: 


x<y: (x l)—(y 1)-(—x& y &1) 
x<y: (x$1)-(y$1)-(4x& p&1) 


These execute in seven instructions on most machines (six if it has and not), which is no better than what we 


have above (five to seven instructions, depending upon the fullness of the set of logic instructions). 


The formulas above involving "nlz" are due to [Shep], and his formula for the x = y predicate is particularly 


useful because a minor variation of it gets the predicate evaluated to a 1/0-valued result with only three 
instructions: 


nlz(x — y) => 5. 


Signed comparisons to 0 are frequent enough to deserve special mention. Below are some formulas for these, 
mostly derived directly from the above. Again, the result is in the sign position. 


x=0: abs(x) — 1 
abs(x + 0x80000000) 
nlz(x) = 26 
~(nlz(x) 5 5) 
a(x | —x) 
ax &(x-1) 

x#0: nabs(x) 
nlz(x) — 32 


(x= 1l)-x [CWG] 
x< 0): X 
x=0: x | (x-1) 


x>Q0: x @ nabs(x) 
(x= 1)-x 
-x & nx 


x=0: =x 


Signed comparisons can be obtained from their unsigned counterparts by biasing the signed operands upwards 
by 231 and interpreting the results as unsigned integers. The reverse transformation also works. Thus we have 


xay = xt} = pt 2, 


xéy= x-2 < p—231, 


di 
Similar relations hold for $, Š >and so on. Addition and subtraction of 23t are equivalent, as they amount to 


inverting the sign bit. 
Another way to get signed comparisons from unsigned is based on the fact that if x and y have the same sign, 


in. . i o Ha 
then YSF = X < Vs whereas if they have opposite signs, then iat wu > [Lamp]. Again, the 


reverse transformation also works, so we have 


x<yp = (x= y)@x;,Oy3, and 


H, 


xeyp = (x<p) Gx, 0 ps), 


i 
where x3,and y3, are the sign bits of x and y, respectively. Similar relations hold for S, S *and so on. 


Using either of these devices enables computing all the usual comparison predicates other than = and Fin 
terms of any one of them, with at most three additional instructions on most machines. For example, let us take 


fd 
* SV as primitive, because it is one of the simplest to implement (it is the carry bit from y - x). Then the other 
predicates can be obtained as follows: 


Comparison Predicates from the Carry Bit 


If the machine can easily deliver the carry bit into a general purpose register, this may permit concise code for 
some of the comparison predicates. Below are listed several of these relations. The notation carry(expression) 
means the carry bit generated by the outermost operation in expression. We assume the carry bit for the 
subtraction x - y is what comes out of the adder for x+y + 1, which is the complement of "borrow." 


x=y: carry(0 — (x —y)), or carry((x + y) +1), or 
carry((x-— y-1)+1) 

x#y: earry((x -= y) - 1), i.e., carry((x — py} + (-1)) 

x<y: mearry((x + 251) — (y + 21)) 

xy carry((y + 231) — (x + 2317) 

Ke p: —carry(x — y) 

x&y: carry- x) 

x=0: carry(0 — x), or carry(x + 1) 

x#O0: carry(x — 1), i.c., carry(« + (-1)) 

x<0: carry(x + x) 


x20: carry(23! — (x + 271)) 


For x > y, use the complement of the expression for x Sy, and similarly for other relations involving "greater 
than." 


The GNU Superoptimizer has been applied to the problem of computing predicate expressions on the IBM 
RS/6000 computer and its close relative PowerPC [GK]. The RS/6000 has instructions for abs(x), nabs(x), doz 
(x, y), and a number of forms of add and subtract that use the carry bit. It was found that the RS/6000 can 
compute all the integer predicate expressions with three or fewer elementary (one-cycle) instructions, a result 
that surprised even the architects of the machine. "All" includes the six two-operand signed comparisons and 
the four two-operand unsigned comparisons, all of these with the second operand being 0, and all in forms that 
produce a 1/0 result or a -1/0 result. PowerPC, which lacks abs(x), nabs(x), and doz(x, y), can compute all the 
predicate expressions in four or fewer elementary instructions. 


How the Computer Sets the Comparison Predicates 


Most computers have a way of evaluating the integer comparison predicates to a 1-bit result. The result bit may 
be placed in a "condition register" or, for some machines (such as our RISC model), in a general purpose 
register. In either case, the facility is often implemented by subtracting the comparison operands and then 
performing a small amount of logic on the result bits to determine the 1-bit comparison result. 


Below is the logic for these operations. It is assumed that the machine computes x -y asx +y +1, and the 
following quantities are available in the result: 


C,, the carry out of the high-order position 
C; the carry into the high-order position 


N, the sign bit of the result 


Z, which equals 1 if the result, exclusive of C,, is all-0, and is otherwise 0 


Then we have the following in Boolean algebra notation (juxtaposition denotes and, + denotes or): 


GEC, (signed overflow) 
vA 

Z 

NBF 

(N@®P)+Z 

(N=V)Z 

NaF 


+Z 


i 


an 
T, 
C, 
Co 


2-12 Overflow Detection 


"Overflow" means that the result of an arithmetic operation is too large or too small to be correctly represented 
in the target register. This section discusses methods that a programmer might use to detect when overflow has 
occurred, without using the machine's "status bits" that are often supplied expressly for this purpose. This is 
important because some machines do not have such status bits (e.g., MIPS), and because even if the machine is 
so equipped, it is often difficult or impossible to access the bits from a high-level language. 


Signed Add/Subtract 


When overflow occurs on integer addition and subtraction, contemporary machines invariably discard the high- 
order bit of the result and store the low-order bits that the adder naturally produces. Signed integer overflow of 
addition occurs if and only if the operands have the same sign and the sum has sign opposite to that of the 
operands. Surprisingly, this same rule applies even if there is a carry into the adder—that is, if the calculation is 
x + y+ 1. This is important for the application of adding multiword signed integers, in which the last addition is 
a signed addition of two fullwords and a carry-in that may be 0 or +1. 


To prove the rule for addition, let x and y denote the values of the one-word signed integers being added, let c 
(carry-in) be 0 or 1, and assume for simplicity a 4-bit machine. Then if the signs of x and y are different, 


-8 = xy £ -l, and 


Osysi, 


or similar bounds apply if x is nonnegative and y is negative. In either case, by adding these inequalities and 
optionally adding in 1 for c, 


-S<extyteei, 


This is representable as a 4-bit signed integer, and thus overflow does not occur when the operands have 
opposite signs. 


Now suppose x and y have the same sign. There are two cases: 


(a) (b) 
-8 < x <-] Qexs7 
—$eye-l QOevel 
Thus, 
(a) (b) 
l6Sxt+yptes-l OQextyptesll5., 


Overflow occurs if the sum is not representable as a 4-bit signed integer— that is, if 


(a) (b) 


losxtytecs-9 S8ixtytesls. 


In case (a), this is equivalent to the high-order bit of the 4-bit sum being 0, which is opposite to the sign of x 
and y. In case (b), this is equivalent to the high-order bit of the 4-bit sum being 1, which again is opposite to the 
sign of x and y. 


For subtraction of multiword integers, the computation of interest is x - y - c where again c is 0 or 1, witha 
value of 1 representing a borrow-in. From an analysis similar to the above, it can be seen that overflow in the 
final value of x - y - c occurs if and only if x and y have opposite signs and the sign of x - y - c is opposite to 
that of x (or, equivalently, the same as that of y). 


This leads to the following expressions for the overflow predicate, with the result being in the sign position. 
Following these with a shift right or shift right signed of 31 produces a 1/0- or a -1/0-valued result. 


Xtyte x«—y-e 
(x=y) &((xt y+e)@x) (x ® y) & ((x—-y-c) Dx) 
(xtypt cg) Oxy &((xt+yteo Gy) ((x-y-c)@x) & ((x- yp-c)= y) 


By choosing the second alternative in the first column, and the first alternative in the second column (avoiding 


the equivalence operation), our basic RISC can evaluate these tests with three instructions in addition to those 
required to compute x + y + c or x - y - c. A fourth instruction (branch if negative) may be added to branch to 
code where the overflow condition is handled. 


If executing with overflow interrupts enabled, the programmer may wish to test to see if a certain addition or 
subtraction will cause overflow, in a way that does not cause it. One branch-free way to do this is as follows: 


xtyte x-y-c¢ 
ze (x= y) & 0x80000000 z4 (x ® y) & 0x80000000 
(x=y)& (xB) +y+te)=y  (x@y)&((xSz)—-y—c) Oy 


The assignment to z in the left column sets z = 0x80000000 if x and y have the same sign, and sets z = 0 if they 
differ. Then, the addition in the second expression is done with x and y having different signs, so it can't 
overflow. If x and y are nonnegative, the sign bit in the second expression will be 1 if and only if (x - 231) + y + 


c 20—that is, iffx+yte 2231, which is the condition for overflow in evaluating x + y + c. If x and y are 
negative, the sign bit in the second expression will be 1 iff (x + 231) + y + c < 0—that is, iff x + y + c < -231, 
which again is the condition for overflow. The term x =y ensures the correct result (0 in the sign position) if x 
and y have opposite signs. Similar remarks apply to the case of subtraction (right column). The code executes in 
nine instructions on the basic RISC. 


It might seem that if the carry from addition is readily available, this might help in computing the signed 
overflow predicate. This does not seem to be the case. However, one method along these lines is as follows. 


If x is a signed integer, then x + 231 is correctly represented as an unsigned number, and is obtained by 
inverting the high-order bit of x. Signed overflow in the positive direction occurs if x + y = 31_that is, if (x + 


231) + (y + 231) =3 - 231, This latter condition is characterized by carry occurring in the unsigned add (which 
means that the sum is greater than or equal to 222) and the high-order bit of the sum being 1. Similarly, 
overflow in the negative direction occurs if the carry is 0 and the high-order bit of the sum is also 0. 


This gives the following algorithm for detecting overflow for signed addition: 


Compute (x Bz) + (y 31), giving sum s and carry c. 
Overflow occurred iff c equals the high-order bit of s. 


The sum is the correct sum for the signed addition, because inverting the high-order bits of both operands does 
not change their sum. 


For subtraction, the algorithm is the same except that in the first step a subtraction replaces the addition. We 
assume that the carry is that generated by computing x -y asx +y + 1. The subtraction is the correct 
difference for the signed subtraction. 


These formulas are perhaps interesting, but on most machines they would not be quite as efficient as the 
formulas that do not even use the carry bit (e.g., overflow = (x =y) & (s Bior addition, and (x B, & (d 


for subtraction, where s and d are the sum and difference, respectively, of x and y). 
How the Computer Sets Overflow for Signed Add/Subtract 


Machines often set "overflow" for signed addition by means of the logic "the carry into the sign position is not 
equal to the carry out of the sign position." Curiously, this logic gives the correct overflow indication for both 
addition and subtraction, assuming the subtraction x - y is done by x+y + 1. Furthermore, it is correct whether 
or not there is a carry- or borrow-in. This does not seem to lead to any particularly good methods for computing 
the signed overflow predicate in software, however, even though it is easy to compute the carry into the sign 
position. For addition and subtraction, the carry/borrow into the sign position is given by the sign bit after 
evaluating the following expressions (where c is 0 or 1): 


carry borrow 
(xt pte) Drey (x-y-c)OxOy 


In fact, these expressions give, at each position i, the carry/borrow into position i. 
Unsigned Add/Subtract 


The following branch-free code may be used to compute the overflow predicate for unsigned add/subtract, with 
the result being in the sign position. The expressions involving a right shift are probably useful only when it is 
known that c = 0. The expressions in brackets compute the carry or borrow generated from the least significant 
position. 


x+y +e, unsigned 
(x& y) | (x | vy) & tyt e)) 


(x D+ Dix y) | (x | py) &e)) &1] 


x — p-e, unsigned 
(ax & y) | (x= y) & (x-y-c)) 
(ax & y) | (ox | y) & (x- y-e)) 
(x 1l)-(y%1)-[((ax& y) | (ax | yy &e)&ll 


For unsigned add's and subtract's, there are much simpler formulas in terms of comparisons [MIPS]. For 
unsigned addition, overflow (carry) occurs if the sum is less (by unsigned comparison) than either of the 
operands. This and similar formulas are given below. Unfortunately, there is no way in these formulas to allow 
for a variable c that represents the carry- or borrow-in. Instead, the program must test c, and use a different type 
of comparison depending upon whether c is 0 or 1. 


x+ y, unsigned x+y+l, unsigned x—y,unsigned ¥-—y—1, unsigned 


di fi 
axiy =x Sy xiy x<y 


+ 


xtpéx xt+yt+lex x-yox x-y-lsx 


The first formula for each case above is evaluated before the add/subtract that may overflow, and it provides a 
way to do the test without causing overflow. The second formula for each case is evaluated after the add/ 
subtract that may overflow. 


There does not seem to be a similar simple device (using comparisons) for computing the signed overflow 
predicate. 


Multiplication 


For multiplication, overflow means that the result cannot be expressed in 32 bits (it can always be expressed in 
64 bits, whether signed or unsigned). Checking for overflow is simple if you have access to the high-order 32 
bits of the product. Let us denote the two halves of the 64-bit product by hi(x x y) and lo(x x y). Then the 
overflow predicates can be computed as follows [MIPS]: 


xx y. unsigned xx y, signed 
hi(x x yp) #0 hi(x x y) # (lox x yp) = 31) 


One way to check for overflow of multiplication is to do the multiplication and then check the result by 
dividing. But care must be taken not to divide by 0, and there is a further complication for signed 
multiplication. Overflow occurs if the following expressions are true: 


Unsigned Signed 
zery Ze-Nxy 
yO &zr4yex (y <0 & x = —2*)) | (y #0 &z+y#x) 


The complication arises when x = -231 and y = -1. In this case the multiplication overflows, but the machine 
may very well give a result of -231. This causes the division to overflow, and thus any result is possible (for 
some machines). Therefore, this case has to be checked separately, which is done by the term y < 0 & x = -231, 
The above expressions use the "conditional and" operator to prevent dividing by 0 (in C, use the && operator). 


It is also possible to use division to check for overflow of multiplication without doing the multiplication (that 
is, without causing overflow). For unsigned integers, the product overflows iff xy > 232 - 1, or x > ((222 - 1)/y, 


x>L(2"-1)/y 


or, since x is an integer, J: Expressed in computer arithmetic, this is 


y#0 &x (0xFFFFFFFF £y). 


For signed integers, the determination of overflow of x * y is not so simple. If x and y have the same sign, then 
overflow occurs iff xy > 231 - 1. If they have opposite signs, then overflow occurs iff xy < -231. These 
conditions may be tested as indicated in Table 2-2, which employs signed division. 


Table 2-2. Overflow Test for Signed Multiplication 


i >0 > 0x7FFFFFFF + y < 0x80000000 = x 
So < 0x80000000 = y x #0 Ry < Ox7FFFEEEF + x 


This test is awkward to implement because of the four cases. It is difficult to unify the expressions very much 
because of problems with overflow and with not being able to represent the number +231, 


The test can be simplified if unsigned division is available. We can use the absolute values of x and y, which 
are correctly represented under unsigned integer interpretation. The complete test can then be computed as 
shown below. The variable c = 231 - 1 if x and y have the same sign, and c = 23! otherwise. 


ee ((x= y) 31)+231 
x <— abs(.x) 

p< abs(y) 
y4#0&xs(cfy) 


The number of leading zeros instruction may be used to give an estimate of whether or not x * y will overflow, 
and the estimate may be refined to give an accurate determination. First, consider the multiplication of unsigned 
numbers. It is easy to show that if x and y, as 32-bit quantities, have m and n leading 0's, respectively, then the 
64-bit product has either m + n or m + n + 1 leading 0's (or 64, if either x = 0 or y = 0). Overflow occurs if the 
64-bit product has fewer than 32 leading 0's. Hence, 


nlz(x) + nlz(y) = 32: Multiplication definitely does not overflow. 


nlai + nlz(y) S 30: Multiplication definitely does overflow, 


For nlz(x) + nlz(y) = 31, overflow may or may not occur. In this case, the overflow assessment may be made by 


= yl yI 
evaluating t=xLy/2] "This will not overflow. Since xy is 2t or, if y is odd, 2t + x, the product xy overflows 


ift 2231, These considerations lead to a plan for computing xy but branching to "overflow" if the product 
overflows. This plan is shown in Figure 2-2. 


Figure 2-2 Determination of overflow of unsigned multiplication. 


unsigned x, y, Z, m, n, t; 


m = nlz(x); 
n = nlz(y); 
if (m + n <= 30) goto overflow; 
t= xy >> 1); 
if ((int)t < 0) goto overflow; 
Z = t*2; 
if (y &1) i 
Zz e272 KM 
if (z < x) goto overflow; 


} 


// z is the correct product of x and y. 


For the multiplication of signed integers, we can make a partial determination of whether or not overflow 
occurs from the number of leading 0's of nonnegative arguments, and the number of leading 1's of negative 
arguments. Let 


m = nilz(x)+ niz(x), and 
n = nla(y)+ nlz(p). 


Then, we have 


m +n 2 34: Multiplication definitely does not overflow. 


m +n = 31: Multiplication definitely does overflow. 


There are two ambiguous cases: 32 and 33. The case m + n = 33 overflows only when both arguments are 
negative and the true product is exactly 23t (machine result is -231), so it can be recognized by a test that the 


product has the correct sign (that is, overflow occurred if m Bn Bom * n) <0). When m + n = 32, the 
distinction is not so easily made. 


We will not dwell on this further, except to note that an overflow estimate for signed multiplication can also be 
made based on nlz(abs(x)) + nlz(abs(y)), but again there are two ambiguous cases (a sum of 31 or 32). 


Division 


For the signed division x + y, overflow occurs if the following expression is true: 


y=0 | (x = 0x80000000 & y =-1) 


Most machines signal overflow (or trap) for the indeterminate form 0 + 0. 


Straightforward code for evaluating this expression, including a final branch to the overflow handling code, 
consists of seven instructions, three of which are branches. There do not seem to be any particularly good tricks 
to improve on this, but below are a few possibilities: 


[abs(y © 0x80000000) | (abs(x) & abs(y = 0x80000000))] < 0 


That is, evaluate the large expression in brackets, and branch if the result is less than 0. This executes in about 
nine instructions, counting the load of the constant and the final branch, on a machine that has the indicated 
instructions and that gets the "compare to 0" for free. 


Some other possibilities are to first compute z from 


z e (x © 0x80000000) | (y +1) 


(three instructions on many machines), and then do the test and branch on y = 0 | z = 0 in one of the following 
ways: 


(Cy | -y) & (z | -z)) 20 
(nabs(y) & nabs(z)) = 0 
((nlz(y) | nlz(z)) = 5)#0 


These execute in nine, seven, and eight instructions, respectively, on a machine that has the indicated 
instructions. The last line represents a good method for PowerPC. 


a H ,, 
For the unsigned division * +’, overflow occurs if and only if y = 0. 


2-13 Condition Code Result of Add, Subtract, and Multiply 


Many machines provide a "condition code" that characterizes the result of integer arithmetic operations. Often 
there is only one add instruction, and the characterization reflects the result for both unsigned and signed 
interpretation of the operands and result (but not for mixed types). The characterization usually consists of the 
following: 


e Whether or not carry occurred (unsigned overflow) 
e Whether or not signed overflow occurred 


e Whether the 32-bit result, interpreted as a signed two's-complement integer and ignoring carry and 
overflow, is negative, 0, or positive 


Some older machines give an indication of whether the infinite precision result (that is, 33-bit result for add's 
and subtract's) is positive, negative, or 0. However, this indication is not easily used by compilers of high-level 
languages, and so has fallen out of favor. 


For addition, only nine of the 12 combinations of these events are possible. The ones that cannot occur are "no 
carry, overflow, result > 0," "no carry, overflow, result = 0," and "carry, overflow, result < 0." Thus, four bits 
are, just barely, needed for the condition code. Two of the combinations are unique in the sense that only one 
value of inputs produces them: Adding 0 to itself is the only way to get "no carry, no overflow, result = 0," and 
adding the maximum negative number to itself is the only way to get "carry, overflow, result = 0." These 
remarks remain true if there is a "carry in"—that is, if we are computing x + y+ 1. 


For subtraction, let us assume that to compute x - y the machine actually computes x +y + 1, with the carry 
produced as for an add (in this scheme the meaning of "carry" is reversed for subtraction, in that carry = 1 
signifies that the result fits in a single word, and carry = 0 signifies that the result does not fit in a single word). 
Then for subtraction only seven combinations of events are possible. The ones that cannot occur are the three 
that cannot occur for addition, plus "no carry, no overflow, result = 0," and "carry, overflow, result = 0." 


If a machine's multiplier can produce a doubleword result, then two multiply instructions are desirable: one for 
signed and one for unsigned operands. (On a 4-bit machine, in hexadecimal, F x F = 01 signed, and F x F = E1 
unsigned). For these instructions, neither carry nor overflow can occur, in the sense that the result will always 
fit in a doubleword. 


For a multiplication instruction that produces a one-word result (the low-order word of the doubleword result), 
let us take "carry" to mean that the result does not fit in a word with the operands and result interpreted as 
unsigned integers, and let us take "overflow" to mean that the result does not fit in a word with the operands 
and result interpreted as signed two's-complement integers. Then again, there are nine possible combinations of 
results, with the missing ones being "no carry, overflow, result > 0," "no carry, overflow, result = 0," and 
"carry, no overflow, result = 0." Thus, considering addition, subtraction, and multiplication together, ten 


combinations can occur. 


2-14 Rotate Shifts 


These are rather trivial. Perhaps surprisingly, this code works for n ranging from 0 to 32 inclusive, even if the 
shifts are mod-32. 


Rotate left a: yo(x<«n) | (x (32-a)) 


Rotate right a: y(x an) | (x= (32 -n)) 


2-15 Double-Length Add/Subtract 


Using one of the expressions shown on page 29 for overflow of unsigned addition and subtraction, we can 
easily implement double-length addition and subtraction without accessing the machine's carry bit. To illustrate 
with double-length addition, let the operands be (x4, Xo) and (y1, yo), and the result be (z4, Zo). Subscript 1 


denotes the most significant half, and subscript 0 the least significant. We assume that all 32 bits of the registers 
are used. The less significant words are unsigned quantities. 


ig Xp t Fo 


ct (Xp & Fo) | (Xp | Fo) & sZ) = 3] 


z) xX, Fy te 


if 
This executes in nine instructions. The second line can be © g (Zp i Xo) *permitting a four-instruction 
solution on machines that have this comparison operator in a form that gives the result as a 1 or 0 in a register, 
such as the "SLTU" (Set on Less Than Unsigned) instruction on MIPS [MIPS]. 


Similar code for double-length subtraction (x - y) is 
Zo S Xo — Fo 
b (AX, & Fo) | (xp = Yo) & Zo)] = 3] 


aex yh 


This executes in eight instructions on a machine that has a full set of logical instructions. The second line can 
h G (Xp = Fo) $ e.e : : : : i we P 

be "* permitting a four-instruction solution on machines that have the "SLTU" instruction. 

Double-length addition and subtraction can be done in five instructions on most machines by representing the 


multiple-length data using only 31 bits of the least significant words, with the high-order bit being 0 except 
momentarily when it contains a carry or borrow bit. 


2-16 Double-Length Shifts 


Let (x4, Xo) be a pair of 32-bit words to be shifted left or right as if they were a single 64-bit quantity, with x, 
being the most significant half. Let (y4, yo) be the result, interpreted similarly. Assume the shift amount n is a 


variable ranging from 0 to 63. Assume further that the machine's shift instructions are modulo 64 or greater. 
That is, a shift amount in the range 32 to 63 or -32 to -1 results in an all-0 word, unless the shift is a signed right 
shift, in which case the result is 32 sign bits from the word shifted. (This code will not work on the Intel x86 
machines, which have mod-32 shifts.) 


Under these assumptions the shift left double operation may be accomplished as follows (eight instructions): 


ex, <a | x, 2 (32-24) | x, æ (n -32) 
I | 0 ü 


Fo — Ai =, fl 


The main connective in the first assignment must be or, not plus, to give the correct result when n = 32. If it is 
known that 0 =n £32, the last term of the first assignment may be omitted, giving a six-instruction solution. 


Similarly, a shift right double unsigned operation may be done with 


Vo HX wa | x, = (32-n) | x, + (n — 32) 


r 


u 
y ox, ce. 


Shift right double signed is more difficult, because of an unwanted sign propagation in one of the terms. 
Straightforward code follows: 


if n <32 then py Gx ya | x, « (32-4) 
else yy © x => (n —32) 


yen 


If your machine has the conditional move instructions, it is a simple matter to express this in branch-free code, 
in which form it takes eight instructions. If the conditional move instructions are not available, the operation 


may be done in ten instructions by using the familiar device of constructing a mask with the shift right signed 
31 instruction to mask the unwanted sign propagating term: 


rexan | x) K(32—n) | (Cx, > (a -32)) & (32-4) = 319] 


+ 
Vy, xX, Sa 


2-17 Multibyte Add, Subtract, Absolute Value 


Some applications deal with arrays of short integers (usually bytes or halfwords), and often execution is faster 
if they are operated on a word at a time. For definiteness, the examples here deal with the case of four 1-byte 
integers packed into a word, but the techniques are easily adapted to other packings, such as a word containing 
a 12-bit integer and two 10-bit integers, and so on. These techniques are of greater value on 64-bit machines, 
because more work is done in parallel. 


Addition must be done in a way that blocks the carries from one byte into another. This can be accomplished by 
the following two-step method: 


1. Mask out the high-order bit of each byte of each operand and add (there will then be no carries 
across byte boundaries). 


2. Fix up the high-order bit of each byte with a 1-bit add of the two operands and the carry into that 
bit. 


The carry into the high-order bit of each byte is of course given by the high-order bit of each byte of the sum 
computed in step 1. The subsequent similar method works for subtraction: 


Addition 
s & {x & OXTFTFTEFTF) + (y & 0x7F7F7F7F) 
s<—((x ® y) & Ox80808080) $s 


Subtraction 
de {x | Ox80808080) —(» & 0x7F7F7F7F) 
d<—((x ® y) | Ox7F7F7FIF)=d 


These execute in eight instructions, counting the load of 0x7F7F7F7F, on a machine that has a full set of 
logical instructions. (Change the and and or of 0x80808080 to and not and or not, respectively, of 
0x7F7F7F7F.) 


There is a different technique for the case in which the word is divided into only two fields. In this case, 
addition can be done by means of a 32-bit addition followed by subtracting out the unwanted carry. On page 28 


we noted that the expression (x + y) p, p, gives the carries into each position. Using this and similar 


observations about subtraction gives the following code for adding/subtracting two halfwords modulo 216 
(seven instructions): 


Addition Subtraction 


Se xT} dex-y 
ceis Dx y) & 0x0001 0000 b+ (dOx@y) & 0x0001 0000 
sese ded+tb 


Multibyte absolute value is easily done by complementing and adding 1 to each byte that contains a negative 
integer (that is, has its high-order bit on). The following code sets each byte of y equal to the absolute value of 
each byte of x (eight instructions): 


a <— x & 0x80808080 // Isolate signs. 


b—as7 i! Integer 1 where x is negative. 
m<—(a—b) | a // OxFF where x is negative. 
ype(x@m)+b il Complement and add | where negative. 


The third line could as well be m a + a - b. The addition of b in the fourth line cannot carry across byte 


boundaries, because the quantity x Bn has a high-order 0 in each byte. 


2-18 Doz, Max, Min 
The "doz" function is "difference or zero," defined as follows, for signed arguments: 


ix-—VPx2 a 
doz(x, y) x ee 


0, xey. 


It has been called "first grade subtraction," because the result is 0 if you try to take away too much. We will use 
it to implement max(x, y) and min(x, y). In this connection it is important to note that doz(x, y) can be negative; 
it is negative if the subtraction overflows. The difference or zero function can be used directly to implement the 
Fortran IDIM function, although in Fortran, results are generally undefined if overflow occurs. 


There seems to be no very good way to implement doz(x, y), max(x, y), and min(x, y) in a branch-free way that 
is applicable to most computers. About the best we can do is to compute doz(x, y) using one of the expressions 
given on page 22 for the x < y predicate, and then compute max(x, y) and min(x, y) from it, as follows: 


d«x-y 
doz(x, y) = d& [((d=((x@ y & (d@x))) 431] 
max(x, y) = y + doz(x, y) 


min(x, y) = x—doz(x, y) 


This computes doz(x, y) in seven instructions if the machine has equivalence, or eight if not, and it computes 
max(x, y) or min(x, y) in one more instruction. 


The following are unsigned versions of these functions: 
de—x-y 
dozu(x, y) = d& [((nx& y) | (wey) & d)) S31] 


maxu(x, y) = y + dozuCy, y) 


minu(x, p) = x —dozu(x, p) 


The IBM RISC/6000 computer, and its predecessor the 801, has doz(x, y) provided as a single instruction. It 
permits computing the max(x, y) and min(x, y) of signed integers in two instructions, and is occasionally useful 
in itself. Implementing max(x, y) and min(x, y) directly is more costly because the machine would then need 
paths from the output ports of the register file back to an input port, bypassing the ALU. 


[2] 
Machines that have conditional move can get destructive max(x, y) and min(x, y) in two instructions. For 
example, on our full RISC, x #=max(x, y) can be calculated as follows (we write the target register first): 


[2] A destructive operation is one that overwrites one or more of its arguments. 


cmplt 


Z Set z = 1 if x < y, else O. 
movne x, 


If z is nonzero, set xX = y. 


2-19 Exchanging Registers 


A very old trick is that of exchanging the contents of two registers without using a third [IBM]: 


xex@yp 
yeyOu 
xeax@yp 


This works well on a two-address machine. The trick also works if Bis replaced by the =logical operation 
(complement of exclusive or), and can be made to work in various ways with add's and subtract's: 


xexty xex- y Xi ¥—-xX 
ycx-y yeyrex yey-<x 
+e xX-—} tey- xYextE 


Unfortunately, each of these has an instruction that is unsuitable for a two-address machine, unless the machine 
has "reverse subtract." 


This little trick can actually be useful in the application of double buffering, in which two pointers are swapped. 
The first instruction can be factored out of the loop in which the swap is done (although this negates the 
advantage of saving a register): 
Outside the loop: tex @ y 
Inside the loop: xex ®t 
ypeyor 


Exchanging Corresponding Fields of Registers 


The problem here is to exchange the contents of two registers x and y wherever a mask bit m; = 1, and to leave 


x and y unaltered wherever m; = 0. By "corresponding" fields, we mean that no shifting is required. The 1-bits 


of m need not be contiguous. The straightforward method is as follows: 


xex im) | (y& m) 
ye(y&im) | (x&m) 


xer 


By using "temporaries" for the four and expressions, this can be seen to require seven instructions, assuming 
that either m or m can be loaded with a single instruction and the machine has and not as a single instruction. 
If the machine is capable of executing the four (independent) and expressions in parallel, the execution time is 
only three cycles. 


A method that is probably better (five instructions, but four cycles on a machine with unlimited instruction- 
level parallelism) is shown in column (a) below. It is suggested by the "three exclusive or" code for exchanging 
registers. 


(a) (b) (c) 
rexOy xXex=y te(xOy)&m 
ye yO(x&m) yey=(x | ñ) xexi 
re-x@yp xexay yey®t 


The steps in column (b) do the same exchange as that of column (a), but column (b) is useful if m does not fit in 
an immediate field but m does, and the machine has the equivalence instruction. 


Still another method is shown in column (c) above [GLS1]. It also takes five instructions (again assuming one 


instruction must be used to load m into a register), but executes in only three cycles on a machine with 
sufficient instruction-level parallelism. 


Exchanging Two Fields of the Same Register 


Assume a register x has two fields (of the same length) that are to be swapped, without altering other bits in the 
register. That is, the object is to swap fields B and D, without altering fields A, C, and E, in the computer word 
illustrated below. The fields are separated by a shift distance k. 


~. 


Straightforward code would shift D and B to their new positions, and combine the words with and and or 
operations, as follows: 


t = (x&m)æk 


(xæ k)& m 
xX =(x&m') | t, [6 


"ai 
“dl 
II 


Here, m is a mask with 1's in field D (and 0's elsewhere), and m' is a mask with 1's in fields A, C, and E. This 
code requires nine instructions and four cycles on a machine with unlimited instruction-level parallelism, 
allowing for two instructions to load the two masks. 


A method that requires only seven instructions and executes in five cycles, under the same assumptions, is 


shown below [GLS1]. It is similar to the code in column (c) on page 39 for interchanging corresponding fields 
of two registers. Again, m is a mask that isolates field D. 


th = [x8 (x k)]& m 


The idea is that t, contains B ®p in position D (and 0's elsewhere), and t contains B ®p in position B. This 


code, and the straightforward code given earlier, work correctly if B and D are "split fields"—that is, if the 1- 
bits of mask m are not contiguous. 


Conditional Exchange 


The exchange methods of the preceding two sections, which are based on exclusive or, degenerate into no- 
operations if the mask m is 0. Hence, they can perform an exchange of entire registers, or of corresponding 


fields of two registers, or of two fields of the same register, if m is set to all 1's if some condition c is true, and 
to all 0's if c is false. This gives branch-free code if m can be set up without branching. 


2-20 Alternating among Two or More Values 


Suppose a variable x can have only two possible values a and b, and you wish to assign to x the value other than 
its current one, and you wish your code to be independent of the values of a and b. For example, in a compiler x 
might be an opcode that is known to be either branch true or branch false, and whichever it is, you want to 
switch it to the other. The values of the opcodes branch true and branch false are arbitrary, probably defined by 
aC #def ine or enum declaration in a header file. 


The straightforward code to do the switch is 


if (x == a) x = b; 
else x = a; 


or, as is often seen in C programs, 
X= XS] a7 D ? a; 
A far better (or at least more efficient) way to code it is either 


xeat+b-x, or 
xe aDbey. 


If a and b are constants, these require only one or two basic RISC instructions. Of course, overflow in 
calculating a + b can be ignored. 


This raises the question: Is there some particularly efficient way to cycle among three or more values? That is, 
given three arbitrary but distinct constants a, b, and c, we seek an easy-to-evaluate function f that satisfies 


f(a) = b, 
JKB) 
f(c) = a. 


II 
E, 
fe 
= 
f 


It is perhaps interesting to note that there is always a polynomial for such a function. For the case of three 
constants, 


Equation 2 


_ (x-aj)(x—b) m (x—bj(x—-—c) (x-ec)(x- a), 


Jia} = (c-a)(e-b) (a-b)la-c) (b-ec(b-a) 


(The idea is that if x = a, the first and last terms vanish, and the middle term simplifies to b, and so on.) This 
requires 14 arithmetic operations to evaluate, and, for arbitrary a, b, and c, the intermediate results exceed the 
computer's word size. But it is just a quadratic; if written in the usual form for a polynomial and evaluated 

[3] 
using Horner's rule, it would require only five arithmetic operations (four for a quadratic with integer 
coefficients, plus one for a final division). Rearranging Equation (2) accordingly gives 


BI Horner's rule simply factors out x. For example, it evaluates the fourth-degree polynomial ax4 + bx? + cx2 + dx + e 
as x(x(x(ax + b) + c) + d) + e. For a polynomial of degree n it takes n multiplications and n additions, and it is very 
suitable for the multiply-add instruction. 


| 
{a-b)(a-—ec)(b-c) 


+[(a—b)b* + (b-c +(e -—a)a*)a 


fix) = Tiia -= bja + {b-b + {c-a 


+ [a -— batb + (b -cib e +(e-—ajac*]}. 


This is getting too complicated to be interesting (or practical). 


Another method, similar to Equation (2) in that just one of the three terms survives, is 


f(x) = (-(¥ = c)) & a) t+ ((-(e = a)) & b)+ ((-(x = b))& ce). 


This takes 11 instructions if the machine has the equal predicate, not counting loads of constants. Because the 
two addition operations are combining two 0 values with a nonzero, they can be replaced with or or exclusive 
or operations. 


The formula can be simplified by precalculating a - c and b - c, and then using [GLS1]: 


f(x) = (-(¥ = €)) & (a—0)) + ((-(¥ = @)) & (b-€)) +e, or 
f(x) = ((-(¥ = €)) & (a @e)) @((-(x= a)) & (be) Ge. 


Each of these operations takes eight instructions. But on most machines these are probably no better than the 
straightforward C code shown below, which executes in four to six instructions for small a, b, and C. 


if (x == a) x = b; 
else if (x == b) x= C; 
else x = a; 


Pursuing this matter, there is an ingenious branch-free method of cycling among three values on machines that 
do not have comparison predicate instructions [GLS1]. It executes in eight instructions on most machines. 


Because a, b, and c are distinct, there are two bit positions, n, and n», where the bits of a, b, and c are not all the 


same, and where the "odd one out" (the one whose bit differs in that position from the other two) is different in 
positions n, and n». This is illustrated below for the values 21, 31, and 20, shown in binary. 


10101 c 
111l a 
10100 b 


F} Ha 


Without loss of generality, rename a, b, and c so that a has the odd one out in position n, and b has the odd one 
out in position n», as shown above. Then there are two possibilities for the values of the bits at position n4, 


Caps bn €, ) = (0, 1, 1) or (1, 0, 0). 
(any b,, 


for each of these cases are shown below: 


namely 


Similarly, there are two possibilities for the bits at 
,¢,.) =(0, 1, 0) or (1, 0, 1). 


” This makes four cases in all, and formulas 


t 


position ny, namely 


Case |. (4, . byy Cr) =(0, 1, 1), (ps b,» €, ) = (0, 1, 0): 


fix) X,, * (a-b)Ṣ4 Xp * (c-a)t+ h 


Il 


Case 2. (aps Ép En) 


(0, 1, 1), (aps 8,4 €,,) = (1, 0, 1): 


I(x) = x, *#(a—b) +x, #(a—e)+ (b+ e-a) 


Case 3. (apo b, am) (1, 0, 0), (apo 6 np € Ca) = (0, 1, 0): 


I(x) 


Xn *(B ajytx, «(e -a)+a 
Case 4. (an Èn Cm) = {1, 0, 0), (anp 8, 1¢ &n,) = (1, 0, 1): 


fix) = x, *+(b-a)}t+x, _t(a—e)+e 


In these formulas, the left operand of each multiplication is a single bit. A multiplication by 0 or 1 may be 
converted into an and with a value of 0 or all 1's. Thus, the formulas can be rewritten as illustrated below for 
the first formula: 


f(x) = ((x = (31-”,)) & 31) &(a—- 6) + ((x & (31-n,)) & 31)&(c- a) +h 


Because all variables except x are constants, this can be evaluated in eight instructions on the basic RISC. Here 
again, the additions and subtractions can be replaced with exclusive or. 


This idea can be extended to cycling among four or more constants. The essence of the idea is to find bit 
positions n4, N», ..., at which the bits uniquely identify the constants. For four constants, three bit positions 


always suffice. Then (for four constants) solve the following equation for s, t, u, and v (that is, solve the system 


x 
of four linear equations in which f(x) is a, b, c, or d, and the coefficients “‘are 0 or 1): 


{(x) = x, Stx,f+%, uty 


If the four constants are uniquely identified by only two bit positions, the equation to solve is 


f(x) = x, 5 + Xat + Xg Enl +v. 


Chapter 3. Power-of-2 Boundaries 


Rounding Up/Down to a Multiple of a Known Power of 2 


Rounding Up/Down to the Next Power of 2 


Detecting a Power-of-2 Boundary Crossing 


3-1 Rounding Up/Down to a Multiple of a Known Power of 2 


Rounding an unsigned integer x down to, for example, the next smaller multiple of 8, is trivial: x & -8 does it. 
PEN (x33 ; ; : 
An alternative is *"* “* "These work for signed integers as well, provided "round down" means to 


round in the negative direction (e.g., (-37) & (-8) = -40). 


Rounding up is almost as easy. For example, an unsigned integer x can be rounded up to the next greater 
multiple of 8 with either of 


(x+7)&—-—8, or 
x + (—x & 7). 


These expressions are correct for signed integers as well, provided "round up" means to round in the positive 
direction. The second term of the second expression is useful if you want to know how much you must add to x 
to make it a multiple of 8 [Gold]. 


To round a signed integer to the nearest multiple of 8 toward 0, you can combine the two expressions above in 
an obvious way: 


f<—(x>>31)&7; 
(x+r)&-8 


i fr 
An alternative for the first line is fe (X= 2) 29 
or if the constant is too large for its immediate field. 


*which is useful if the machine lacks and immediate, 


Sometimes the rounding factor is given as the log, of the alignment amount (e.g., a value of 3 means to round 


to a multiple of 8). In this case, code such as the following may be used, where k = log>(alignment amount): 


round down: x & ((-1) =k) 
(x= k)<k 
round up: fe(l<A)-1; (x+H&-1 
te i-l) k, (x-t-1) &f 


3-2 Rounding Up/Down to the Next Power of 2 


We define two functions that are similar to floor and ceiling, but which are directed roundings to the closest 
integral power of 2, rather than to the closest integer. Mathematically, they are defined by 


undefined, x < 0, undefined, x <0, 
flp2{x) = 40, x= 0; clp2(x) = 40, r= Ù, 
glles2* | otherwise; giles*|, otherwise. 


The initial letters of the function names are intended to suggest "floor" and "ceiling." Thus, flp2(x) is the 


< 


greatest power of 2 that is =x, and clp2(x) is the least power of 2 that is 2x. These definitions make sense 
even when x is not an integer (e.g., flp2(0.1) = 0.0625). The functions satisfy several relations analogous to 
those involving floor and ceiling, such as those shown below, where n is an integer. 


[x] = [x] iff x isan integer flp2(x) = clp2(x) iff x is a power of 2 or is 0 
[xta] =LbLxjtn flp2(2"x) = 2"flp2(x) 
[x] = -|-x| elp2(x) = 1/flp2(1/x), x #0 


Computationally, we deal only with the case in which x is an integer, and we take it to be unsigned, so the 
functions are well defined for all x. We require the value computed to be the arithmetically correct value 
modulo 22 (that is, we take clp2(x) to be 0 for x > 231). The functions are tabulated below for a few values of x. 


x flp2(x) clp2(x) 


0 0 0) 

] 

2 2 2 

3 2 4 

4 4 4 

5 4 8 
931_] ot 331 
33l 331 931 
2314] 27! Ü 
232 _ | 231 Ü 


Functions flp2 and clp2 are connected by the relations shown below. These can be used to compute one from 
the other, subject to the indicated restrictions. 


clp2(x) = 2 flp2(x—1), x#1, 
flp2(2x — 1), 1<x< 251, 
clp2(xn22+1), x0, 


flp2(x) 


= clp2(ix+1)22,  x< 23), 


The round-up and round-down functions can be computed quite easily with the number of leading zeros 
instruction, as shown below. However, for these relations to hold for x = 0 and x > 231, the computer must have 
its shift instructions defined to produce 0 for shift amounts of -1, 32, and 63. Many machines (e.g., PowerPC) 
have "mod 64" shifts, which do this. In the case of -1, it is adequate if the machine shifts in the opposite 
direction (that is, a shift left of -1 becomes a shift right of 1). 


Il 


flp2(x) = 1 (31 - nlz{x)) 
1 << (nlz(x) ® 31) 
0x80000000 = nlz(x) 
1 <= (32 — nlz(x —1)) 


0x80000000 =. (nlz(x —1)—1) 


clp2(x) 


II 


Rounding Down 


Figure 3-1 illustrates a branch-free algorithm that might be useful if number of leading zeros is not available. 
This algorithm is based on right-propagating the leftmost 1-bit, and executes in 12 instructions. 


Figure 3-1 Greatest power of 2 less than or equal to x, branch-free. 


unsigned flp2(unsigned x) { 


x=x | (x >> 1); 
x=x | (x >> 2); 
x=x | (x >> 4); 
x =x | (x> 8); 
x x | (x >>16); 


return X = {x >> 1); 


Figure 3-2 shows two simple loops that compute the same function. All variables are unsigned integers. The 
loop on the right keeps turning off the rightmost 1-bit of x until x = 0, and then returns the previous value of x. 


Figure 3-2 Greatest power of 2 less than or equal to x, simple loops. 


y = 0x80000000; do { 
while (y > x) y = X; 
y =y ers; x=x& (x - 1); 
return } while(x != 0); 
return y; 


The loop on the left executes in 4nlz(x) + 3 instructions. The loop on the right, for x Æo, executes in 4pop(x) 
[1] 


instructions, if the comparison to 0 is zero-cost. 


[4] pop(x) is the number of 1-bits in x. 
Rounding Up 


The right-propagation trick yields a good algorithm for rounding up to the next power of 2. This algorithm, 
shown in Figure 3-3, is branch-free and runs in 12 instructions. 


Figure 3-3 Least power of 2 greater than or equal to x. 


unsigned clp2(unsigned x) { 


x =X - 1; 

Ree xX | (xX >> I); 
XS xX | (kh ee 2); 
xax || (x>> 4); 
xax | (x >> 8); 
x x | (x >>16); 


return x + 1; 


An attempt to compute this with the obvious loop does not work out very well: 
y= 1; 


while (y < x) // Unsigned comparison. 
y= 29; 
return y; 


This code returns 1 for x = 0, which is probably not what you want, loops forever for x 2231, and executes in 
4n +3 instructions, where n is the power of 2 of the returned integer. Thus, it is slower than the branch-free 


code, in terms of instructions executed, for n 3 (x >98). 


3-3 Detecting a Power-of-2 Boundary Crossing 


Assume memory is divided into blocks that are a power of 2 in size, starting at address 0. The blocks may be 
words, doublewords, pages, and so on. Then, given a starting address a and a length l, we wish to determine 


whether or not the address range from a toa + l- 1, l =>, crosses a block boundary. The quantities a and l are 
unsigned and any values that fit in a register are possible. 


If 1 = 0 or 1, a boundary crossing does not occur, regardless of a. If |! exceeds the block size, a boundary 
crossing does occur, regardless of a. For very large values of l (wraparound is possible), a boundary crossing 
can occur even if the first and last bytes of the address range are in the same block. 


There is a surprisingly concise way to detect boundary crossings on the IBM System/370 [CJS]. This method is 
illustrated below for a block size of 4096 bytes (a common page size). 


O RA,=A(-4096) 
ALR RA,RL 
BO CROSSES 


The first instruction forms the logical or of RA (which contains the starting address a) and the number 
OxFFFFF000. The second instruction adds in the length, and sets the machine's 2-bit condition code. For the 
add logical instruction, the first bit of the condition code is set to 1 if a carry occurred, and the second bit is set 
to 1 if the 32-bit register result is nonzero. The last instruction branches if both bits are set. At the branch target, 
RA will contain the length that extends beyond the first page (this is an extra feature that was not asked for). 


If, for example, a = 0 and | = 4096, a carry occurs but the register result is 0, so the program properly does not 
branch to label CROSSES. 


Let us see how this method can be adapted to RISC machines, which generally do not have branch on carry 
and register result nonzero. Using a block size of 8 for notational simplicity, the method of [CJS] branches to 


CROSSES if a carry occurred ((a | -8) + l = 932) and the register result is nonzero ((a | -8) + 1 F232), Thus, it 
is equivalent to the predicate 


(a | -8)+/> 2°, 


This in turn is equivalent to getting a carry in the final addition in evaluating ((a | -8) -1) + I. If the machine has 
branch on carry, this can be used directly, giving a solution in about five instructions counting a load of the 
constant -8. 


z ee i iv g 
If the machine does not have branch on carry, we can use the fact that carry occurs in rty iay < F 


(see "Unsigned Add/Subtract" on page 29) to obtain the expression 


—((a | -8)-1) zF. 


Using various identities such as (x - 1) = -x gives the following equivalent expressions for the "boundary 
crossed" predicate: 


—(a | -8)</ 
=a | -8§)+1 
(a~a&7)+124 


These can be evaluated in five or six instructions on most RISC computers. 
Using another tack, clearly an 8-byte boundary is crossed if 


aT- lR. 


This cannot be directly evaluated because of the possibility of overflow (which occurs if | is very large), but it 
is easily rearranged to 8 - (a & 7) < l, which can be directly evaluated on the computer (no part of it overflows). 
This gives the expression 


8-(a@&7) zl, 


which can be evaluated in five instructions on most RISCs (four if it has subtract from immediate). If a 
boundary crossing occurs, the length that extends beyond the first block is given by I - (8 - (a & 7)) which can 
be calculated with one additional instruction (subtract). 


Chapter 4. Arithmetic Bounds 


Checking Bounds of Integers 


Propagating Bounds through Add's and Subtract's_ 


Propagating Bounds through Logical Operations 


4-1 Checking Bounds of Integers 
By "bounds checking" we mean to verify that an integer x is within two bounds a and b—that is, that 


asxso. 


We first assume that all quantities are signed integers. 


An important application is the checking of array indexes. For example, suppose a one-dimensional array A can 
be indexed by values from 1 to 10. Then, for a reference A(i), a compiler might generate code to check that 


lsisl0 


and to branch or trap if this is not the case. In this section we show that this check can be done with a single 
comparison, by performing the equivalent check [PL8]: 


j-1 £9, 


This is probably better code, because it involves only one compare-branch (or compare-trap), and because the 
quantity i - 1 is probably needed anyway for the array addressing calculations. 


Does the implementation 


asxs<h=ox-a<b-a 


always work, even if overflow may occur in the subtractions? It does, provided we somehow know that a Sp. 
In the case of array bounds checking, language rules may require that an array not have a number of elements 
(or number of elements along any axis) that are 0 or negative, and this rule can be verified at compile time or, 
for dynamic extents, at array allocation time. In such an environment, the transformation above is correct, as we 
will now show. 


It is convenient to use a lemma, which is good to know in its own right. 


Lemma. If a and b are signed integers and a Sp, then the computed value b - a correctly represents the 
arithmetic value b - a, if the computed value is interpreted as unsigned. 


Proof. (Assume a 32-bit machine.) Because a Sp, the true difference b - a is in the range 0 to (23! - 1) - (-234) 
= 232 - 1, If the true difference is in the range 0 to 23! - 1, then the machine result is correct (because the result 
is representable under signed interpretation), and the sign bit is off. Hence the machine result is correct under 
either signed or unsigned interpretation. 


If the true difference is in the range 23! to 232 - 1, then the machine result will differ by some multiple of 232 
(because the result is not representable under signed interpretation). This brings the result (under signed 
interpretation) to the range -231 to -1. The machine result is too low by 2°2, and the sign bit is on. 
Reinterpreting the result as unsigned increases it by 232, because the sign bit is given a weight of +23! rather 


than -231. Hence the reinterpreted result is correct. 


The "bounds theorem" is 

Theorem. If a and b are signed integers and a Sp, then 
Equation 1 

atxeth = x-acb—-a. 


< 


Proof. We distinguish three cases, based on the value of x. In all cases, by the lemma, since a =b, the 
computed value b - a is equal to the arithmetic value b - a if b - a is interpreted as unsigned, as it is in Equation 


(1). 


Case 1, x < a: In this case, x - a interpreted as unsigned is x - a + 232. Whatever the values of x and b are 
(within the range of 32-bit numbers), 


x + 232 > p, 


Therefore 


x-at2¥2sh-a, 


and hence 


x-—a>b-a. 


In this case, both sides of Equation (1) are false. 


Case 2, a Ex Sp: Then, arithmetically, x - a Sp - a. Because a Sy, by the lemma x - a equals the computed 


value x - a if the latter is interpreted as unsigned. Hence 


Ef. 
—-a<ch—a: 


that is, both sides of Equation (1) are true. 


Case 3, x > b: Then x - a > b - a. Because in this case x > a (because b > a), by the lemma x - a equals the value 
of x - a if the latter is interpreted as unsigned. Hence 


x-ash-a: 


that is, both sides of Equation (1) are false. 


The theorem stated above is also true if a and b are unsigned integers. This is because for unsigned integers the 
lemma holds trivially, and the above proof is also valid. 


Below is a list of similar bounds-checking transformations, with the one of the theorem above stated again. 
These all hold for either signed or unsigned interpretation of a, b, and x. 


Equation 2 


ifashthenas<x<h = x-ath—-a = b-xeh-a 


ifa<zhthnatx<h = x-ath—-a 


ifashbthena<x<hb =~ b-xjh-a 


ifa<hbthena<x<6 = x-a-126h-a-1 = 6-x-12h-a-1 


In the last rule, b - a - 1 may be replaced with b + ~a. 


There are some quite different transformations that may be useful when the test is of the form -27-1 Sx Son 
- 1. This is a test to see if a signed quantity x can be correctly represented as an n-bit two's-complement integer. 
To illustrate with n = 8, the following tests are equivalent: 


a. -128 £ x 2127 


7 x +128 £255 


Cc. 

d Xo>7 = x>3l 

e. (x> 7)+ (x31) = 0 
£ (x <«< 24) $24 = x 

g. x ® {x 31) s127 


Equation (b) is simply an application of the preceding material in this section. Equation (c) is as well, after 
shifting x right seven positions. Equations (c)-(f) and possibly (g) are probably useful only if the constants in 
Equations (a) and (b) exceed the size of the immediate fields of the computer's compare and add instructions. 


Another special case involving powers of 2 is 


O<x52"-1lo(x>n) =0, 


or, More generally, 


acxsat2"-le((x-a)=n)=0. 


4-2 Propagating Bounds through Add's and Subtract's 


Some optimizing compilers perform "range analysis" of expressions. This is the process of determining, for 
each occurrence of an expression in a program, upper and lower bounds on its value. Although this 
optimization is not a really big winner, it does permit improvements such as omitting the range check on a C 
"switch" statement and omitting some subscript bounds checks that compilers may provide as a debugging aid. 


Suppose we have bounds on two variables x and y as follows, where all quantities are unsigned: 
Equation 3 


asxeh, and 


cfiyesd. 


Then, how can we compute tight bounds on x + y, x - y, and -x? Arithmetically, of course, a + c Zx+ y Sp + 


d; but the point is that the additions may overflow. 
The way to calculate the bounds is expressed in the following: 


Theorem. If a, b, c, d, x, and y are unsigned integers and 


then 
Equation 4 
O<xtys2@_-1 if at+ces2"-1 and b+d22, 


ate<cxt F <hb+d otherwise; 


Equation 5 


Ocx—ys2?-1 if a—-d<0 and b-c20, 


a-d<x- F <b—-c otherwise; 


Equation 6 
O2-v22%-1 if a=0 and b#0, 
be-xe-a otherwise. 


Inequalities (4) say that the bounds on x + y are "normally" a + c and b + d, but if the calculation of a + c does 
not overflow and the calculation of b + d does overflow, then the bounds are 0 and the maximum unsigned 
integer. Equations (5) are interpreted similarly, but the true result of a subtraction being less than 0 constitutes 
an overflow (in the negative direction). 

Proof. If neither a + c nor b + d overflows, then x + y with x and y in the indicated ranges, cannot overflow, 


making the computed results equal to the true results, so the second inequality of (4) holds. If both a + c and b 
+ d overflow, then so also does x + y. Now arithmetically, it is clear that 


ate- syty- sht+d—22, 


This, however, is what is calculated when the three terms overflow. Hence in this case also, 


a+egxtygb+d, 


If a + c does not overflow but b + d does, then 


ates2%2—-] and 6+d>2°2, 


Because x + y takes on all values in the range a + c to b + d, it takes on the values 232 - 1 and 232—hat is, the 
computed value x + y takes on the values 222 - 1 and 0 (although it doesn't take on all values in that range). 


Lastly, the case that a + c overflows but b + d does not cannot occur, because a Sb and c Sa. 


This completes the proof of inequalities (4). The proof of (5) is similar, but "overflow" means that a true 
difference is less than 0. 


Inequalities (6) can be proved by using (5) with a = b = 0, and then renaming the variables. (The expression -x 
with x an unsigned number means to compute the value of 222 - x, or of =x + 1 if you prefer.) 


Because unsigned overflow is so easy to recognize (see "Unsigned Add/Subtract" on page 29), these results are 


easily embodied in code, as shown in Figure 4-1 for addition and subtraction. The computed lower and upper 
limits are variables S and t, respectively. 


Figure 4-1 Propagating unsigned bounds through addition and subtraction operations. 


S=a+t+c; S=a-d; 

t=b +d; t=b- c; 

if (s >= a && t < b) { if (s >a && t <= b) { 
s = 0; s = 0; 
t = OxFFFFFFFF; } t = OXxFFFFFFFF; } 


Signed Numbers 


The case of signed numbers is not so clean. As before, suppose we have bounds on two variables x and y as 
follows, where all quantities are signed: 


Equation 7 


as=x<h, and 


csysd. 


We wish to compute tight bounds on x + y, x - y, and -x. The reasoning is very similar to that for the case of 
unsigned numbers, and the results for addition are shown below. 


Equation 8 


ate<—2 b+ d<—Bliatesxtypsbtd 
ate<—24, b+d2-251 ;—2lexyt+ ysPl_-1 
Bleagte<BVl hb+d<Blratesxtyshid 
Zleate<2l b+d2231: Bley+ ps2l—-] 


ate2 Vl hbt+d2Blratesxtyshbt+d 


The first row means that if both of the additions a + c and b + d overflow in the negative direction, then the 
computed sum x + y lies between the computed sums a + c and b + d. This is because all three computed sums 
are too high by the same amount (232). The second row means that if the addition a + c overflows in the 
negative direction, and the addition b + d either does not overflow or overflows in the positive direction, then 
the computed sum x + y can take on the extreme negative number and the extreme positive number (although 
perhaps not all values in between), which is not difficult to show. The other rows are interpreted similarly. 


The rules for propagating bounds on signed numbers through the subtraction operation can easily be derived by 
rewriting the bounds on y as 


and using the rules for addition. The results are shown below. 


a-d<-—23!,b-c<—24!:a-dex-ysbh-e 
a—d<-235|,b-—e > —23l: il] y— ys2sl—] 
2eq-d<24,b-c<23!:a@-dex-ystb-e 
2le@—-d<23!,6-¢223!: Blex-yps2l—-] 


a-d225|,bh-ce22*!: a-dsx-ysh-e 


The rules for negation can be derived from the rules for subtraction by taking a = b = 0, omitting some 
impossible combinations, simplifying, and renaming. The results are as follows: 


a= 23, fb = 231: -y = 24 
= —231, 64-231; 23le_ye2sl_] 


a#—23l-_-b<-x<-a 


C code for the case of signed numbers is a bit messy. We will consider only addition. It seems to be simplest to 
check for the two cases in (8) in which the computed limits are the extreme negative and positive numbers. 
Overflow in the negative direction occurs if the two operands are negative and the sum is nonnegative (see 
"Signed Add/Subtract" on page 26). Thus, to check for the condition that a + c < -231, we could let S =a+cC; 
[2] 

and then code something like "if (a<0&&C < 0 && s >= 0) ...." It will be more efficient, however, 
to perform logical operations directly on the arithmetic variables, with the sign bit containing the true/false 
result of the logical operations. Then, we write the above condition as "if ( (a &c &~S) <0) ...." These 


considerations lead to the program fragment shown in Figure 4-2 below. 


[4] In the sense of more compact, less branchy, code; faster-running code may result from checking first for the case of 
no overflow, assuming the limits are not likely to be large. 


Figure 4-2 Propagating signed bounds through an addition operation. 


S=at+cC; 
t= D * a 
uUu=~a&c&-~s&-~(b&d &-t); 
v= ((a4 c) | ~(a 4 S)) & (~b & ~d & t); 
if ((u | v) < 0) { 

S = 0x80000000; 

t = OxX/7FFFFFFF; } 


Here U is true (sign bit is 1) if the addition a + C overflows in the negative direction, and the addition b + d 
does not overflow in the negative direction. Variable V is true if the addition a + C does not overflow and the 
addition b + d overflows in the positive direction. The former condition can be expressed as "a and C have 
different signs, or a and S have the same sign." The "if" test is equivalent to "if (u < O | | v < O )—that 
is, if either U or V is true." 


4-3 Propagating Bounds through Logical Operations 


As in the preceding section, suppose we have bounds on two variables x and y as follows, where all quantities 
are unsigned: 


Equation 9 


a=x<=h, and 


csyed. 


Then what are some reasonably tight bounds on x | y, x & y, x p, and ~x? 


Combining inequalities (9) with some inequalities from Section 2-3 on page 16, and noting that =x = 232 - 1 - x, 
yields 
max(a,c)S(x | yp sb+d, 

O<(x & py) minib, d), 

0<(x@y)<b+d, and 


nabani, 


where it is assumed that the addition b + d does not overflow. These are easy to compute and might be good 
enough for the compiler application mentioned in the preceding section. The bounds in the first two 
inequalities, however, are not tight. For example, writing constants in binary, suppose 


Equation 10 


00010 = x = 00100, and 
01001 = p = 10100. 


Then, by inspection (e.g., trying all 36 possibilities for x and y), we see that 01010 Sx |y) &10111. Thus, the 


lower bound is not max(a, c) nor is it a | c, and the upper bound is not b + d, nor is it b | d. 


Given the values of a, b, c, and d in inequalities (9), how can one obtain tight bounds on the logical 
expressions? Consider first the minimum value attained by x | y. A reasonable guess might be the value of this 
expression with x and y both at their minima—that is, a | c. Example (10), however, shows that the minimum 
can be lower than this. 


To find the minimum, our procedure is to start with x = a and y = c, and then find an amount by which to 
increase either x or y so as to reduce the value of x | y. The result will be this reduced value. Rather than 
assigning a and c to x and y, however, we work directly with a and c, increasing one of them when doing so is 
valid and it reduces the value of a | c. 


The procedure is to scan the bits of a and c from left to right. If both bits are 0, the result will have a 0 in that 
position. If both bits are 1, the result will have a 1 in that position (clearly, no values of x and y could make the 
result less). In these cases, continue the scan to the next bit position. If one scanned bit is 1 and the other is 0, 
then it is possible that changing the 0 to 1 and setting all the following bits in that bound's value to 0 will reduce 
the value of a | c. This change will not increase the value of a | c, because the result has a 1 in that position 
anyway, from the other bound. Therefore, form the number with the 0 changed to 1 and subsequent bits 
changed to 0. If that is less than or equal to the corresponding upper limit, the change can be made; do it, and 
the result is the or of the modified value with the other lower bound. If the change cannot be made (because the 
altered value exceeds the corresponding upper bound), continue the scan to the next bit position. 


That's all there is to it. It might seem that after making the change, the scan should continue, looking for other 
opportunities to further reduce the value of a | c. However, even if a position is found that allows a 0 to be 
changed to 1, setting the subsequent bits to 0 does not reduce the value of a | c, because those bits are already 0. 


C code for this algorithm is shown in Figure 4-3. We assume that the compiler will move the subexpressions 
~a&C anda & ~C out of the loop. More significantly, if the number of leading zeros instruction is available, 
the program can be speeded up by initializing m with 


m = @x80000000 >> nlz(a ^ c); 


Figure 4-3 Minimum value of x | y with bounds on x and y. 


unsigned minOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) { 
if (~a &c &m) { 
temp = (a | m) & -m; 
if (temp <= b) {a = temp; break;} 


} 
else if (a &~c &m) { 
temp = (c | m) & -m; 
if (temp <= d) {c = temp; break;} 


This skips over initial bit positions in which a and C are both 0 or both 1. For this speedup to be effective when 
a ^ C is 0 (that is, when a = C), the machine's shift right instruction should be mod 64. If number of leading 


zeros is not available, it may be worthwhile to use some version of the flp2 function (see page 46) with 
argument a ^ C. 


Now let us consider the maximum value attained by x | y, with the variables bounded as shown in inequalities 
(9). The algorithm is similar to that for the minimum, except it scans the values of bounds b and d (from left to 
right), looking for a position in which both bits are 1. If such a position is found, the algorithm tries to increase 
the value of c | d by decreasing one of the bounds by changing the 1 to 0, and setting all subsequent bits in that 
bound to 1. If this is acceptable (if the resulting value is greater than or equal to the corresponding lower 
bound), the change is made and the result is the value of c | d using the modified bound. If the change cannot be 
done, it is attempted on the other bound. If the change cannot be done to either bound, the scan continues. C 
code for this algorithm is shown in Figure 4-4. Here the subexpression b & d can be moved out of the loop, and 


the algorithm can be speeded up by initializing m with 
mx = Ox80000000 >> nlz(b & d); 


Figure 4-4 Maximum value of x | y with bounds on x and y. 


unsigned maxOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) { 
if (b&d &m) { 
temp = (b - m) | (m - 1); 
if (temp >= a) {b = temp; break; } 
temp = (d - m) | (m - 1); 
if (temp >= c) {d = temp; break; } 
} 
m 


= m >> 1; 


} 


return b | d; 


There are two ways in which we might propagate the bounds of inequalities (9) through the expression x & y: 
algebraic and direct computation. The algebraic method uses DeMorgan's rule: 


x&y = -a(4x | ay) 


Because we know how to propagate bounds precisely through or, and it is trivial to propagate them through not 
i i 
(agzxzb&nbznxn d Jwe have 


minAND(a, b, c, d) = ~Ţ~maxOR(—~ b, na, nd, ~c), and 
maxAND(a, b, c, d) = aminOR(—A, na, nd, mne). 


For the direct computation method, the code is very similar to that for propagating bounds through or. It is 
shown in Figures 4-5 and 4-6. 


Figure 4-5 Minimum value of x & y with bounds on x and y. 


unsigned minAND(unsigned a, unsigned b, 
unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) { 
if (~a &~c &m) { 
temp = (a | m) 


& -m; 
if (temp <= b) {a = 

& 

{c 


temp; break;} 
temp = (c | m) 


-m; 
if (temp <= d) = 


temp; break;} 
J 
m = m >> 1; 
} 


return a & C; 


Figure 4-6 Maximum value of x & y with bounds on x and y. 


unsigned maxAND(unsigned a, unsigned b, 


unsigned c, unsigned d) { 
unsigned m, temp; 


m = 0x80000000; 
while (m != 0) { 
if (b & ~d &m) { 
temp = (b & ~m) | (m - 1); 
if (temp >= a) {b = temp; break;} 
Í 
else if (-b&d&m) { 
temp = (d & ~m) | (m - 1); 
if (temp >= c) {d = temp; break;} 
Í 
m = m > 1; 
Í 
return b & d; 


The algebraic method of finding bounds on expressions in terms of the functions for and, or, and not works for 
all the binary logical expressions except exclusive or and equivalence. The reason these two present a difficulty 


is that when expressed in terms of and, or, and not, there are two terms containing x and y. For example, we are 
to find 


min (x®y) = min ((x&-y) | (4x & y)). 
asxeb asxsah 
cipsd csysd 


The two operands of the or cannot be separately minimized (without proof that it works, which actually it does) 
because we seek one value of x and one value of y that minimizes the whole or expression. 


The following expressions may be used to propagate bounds through exclusive or: 


minXOR(a, 6, c, d) 
maxXOR(a, b, c, d) 


minAND(a, b, nd, ne) | minAND(=4, na, €, d), 
maxOR(0, maxAND(a, b, ~d, ~c), 
0, max AND(—4, ~a, c, d)). 


It is straightforward to evaluate the minXOR and maxXOR functions by direct computation. The code for 
minXOR is the same as that for minOR (Figure 4-3) except with the two break statements removed, and the 


return value changed to a ^ C. The code for maxXOR is the same as that for maxOR (Figure 4-4) except with 
the four lines under the if clause replaced with 


temp = (b - m) | (m - 1); 
if (temp >= a) b = temp; 
else { 
temp = (d - m) | (m - 1); 
if (temp >= c) d = temp; 


and the return value changed to b ^ d. 


Signed Bounds 


If the bounds are signed integers, propagating them through logical expressions is substantially more 
complicated. The calculation is irregular if 0 is within the range a to b, or c to d. One way to calculate the lower 
and upper bounds for the expression x | y is shown in Table 4-1. A "+" entry means that the bound at the top of 
the column is greater than or equal to 0, and a "-" entry means that it is less than 0. The column labelled 
"minOR (signed)" contains expressions for computing the lower bound of x | y and the last column contains 
expressions for computing the upper bound of x | y. One way to program this is to construct a value ranging 
from 0 to 15 from the sign bits of a, b, c, and d, and use a "switch" statement. Notice that not all values from 0 
to 15 are used, because it is impossible to have a > b or c > d. 


For signed numbers, the relation 


asxsé6o obs-xs-a 


holds, so the algebraic method can be used to extend the results of Table 4-1 to other logical expressions 
(except for exclusive or and equivalence). We leave this and similar extensions to others. 


Table 4-1. Signed minOR And maxOR from Unsigned 


HHHH minOR (signed) | maxOR (signed) 


coe 
= 


a a a a 
ed a a a 


in(a, c) 


= OxFFFFFFFF, c, d) i 


inOR(a, b, c, d) 


inOR(a, b, c, OXFFFFFFFF) 


Chapter 5. Counting Bits 


Counting 1-Bits 
Parity 
Counting Leading 0's 


Counting Trailing 0's 


5-1 Counting 1-Bits 


The IBM Stretch computer (ca. 1960) had a means of counting the number of 1-bits in a word as well as the 
number of leading 0's. It produced these two quantities as a by-product of all logical operations! The former 
function is sometimes called population count (e.g., on Stretch and the SPARCVv9). 


For machines that don't have this instruction, a good way to count the number of 1-bits is to first set each 2-bit 
field equal to the sum of the two single bits that were originally in the field, and then sum adjacent 2-bit fields, 
putting the results in each 4-bit field, and so on. A more complete discussion of this trick is in [RND]. The 
method is illustrated in Figure 5-1, in which the first row shows a computer word whose 1-bits are to be 
summed, and the last row shows the result (23 decimal). 


Figure 5-1. Counting 1-bits, "divide and conquer" strategy. 


LOL1TTPIO0OO0OTTOOOTIOIII 


aap op oop Poo op op oa op ofr oo 


0011/0010/001 0J/0010f001 1/001 101 00/0100) 


o0o000101000001000000011000001000/ 
oo0o0000000001001i000000000000111 0 


Ho00000000000000000000000000T0TI TL 


This is an example of the "divide and conquer" strategy, in which the original problem (summing 32 bits) is 
divided into two problems (summing 16 bits), which are solved separately, and the results are combined 
(added, in this case). The strategy is applied recursively, breaking the 16-bit fields into 8-bit fields, and so on. 


In the case at hand, the ultimate small problems (summing adjacent bits) can all be done in parallel, and 


combining adjacent sums can also be done in parallel in a fixed number of steps at each stage. The result is an 
algorithm that can be executed in log5(32) = 5 steps. 


Other examples of divide and conquer are the well-known technique of binary search, a sorting method known 
as quicksort, and a method for reversing the bits of a word, discussed on page 101. 


The method illustrated in Figure 5-1 may be committed to C code as 


= (x & 0x55555555) 
= (Xx & 0x33333333) 
(x & OXOFOFOFOF ) 
= (x & OxXOOFFOOFF ) 
(x & OXOOOOFFFF ) 


((x >> 1) & 0x55555555); 
((x >> 2) & 0xX33333333) ; 
((x >> 4) & OXOFOFOFOF); 
((x >> 8) & OXOOFFOOFF); 
((x >>16) & OXOOOOFFFF); 


++ +++ 


x <x X X X 
II 


The first line uses (X >> 1) & 0X55555555 rather than the perhaps more natural (x & OXAAAAAAAA ) 
>> 1 because the code shown avoids generating two large constants in a register. This would cost an 
instruction if the machine lacks the and not instruction. A similar remark applies to the other lines. 


Clearly, the last and is unnecessary, and other and's may be omitted when there is no danger that a field's sum 
will carry over into the adjacent field. Furthermore, there is a way to code the first line that uses one fewer 
instruction. This leads to the simplification shown in Figure 5-2, which executes in 21 instructions and is 
branch-free. 


Figure 5-2 Counting 1-bits in a word. 


int pop(unsigned x) { 

= X = ((% >> 1) & 0x55555555); 

(x & 0X33333333) + ((x >> 2) & 0x33333333); 
(x + (x >> 4)) & OXOFOFOFOF; 

=X + (x >> 8); 

= X + (x >> 16); 

return x & 0x0000003F; 


x XxX X X X 
II 


The first assignment to X is based on the first two terms of the rather surprising formula 


Equation 1 


wo = Eb Lah 


In Equation (1), we must have x 20. By treating x as an unsigned integer, Equation (1) can be implemented 
with a sequence of 31 shift right immediate's of 1, and 31 subtract's. The procedure of Figure 5-2 uses the first 
two terms of this on each 2-bit field, in parallel. 


There is a simple proof of Equation (1), which is shown below for the case of a 4-bit word. Let the word be 
b3b5b, bp, where each b; = 0 or 1. Then, 


= 


"r 


—(b,- 27+ b, 2! +b- 2°) 
=(b, 2! + b, 2°) 
—{b, - 20) 


b {23-22-21 — 29) 4 b,(22 — 2! — 20) + b (2! — 29) + (2°) 


b, + b, +b, + bo. 


Alternatively, Equation (1) can be derived by noting that bit i of the binary representation of a nonnegative 
integer x is given by 


and summing this for i = 0 to 31. Work it out—the last term is 0 because x < 232., 


Equation (1) generalizes to other bases. For base ten it is 


sum_digits(x) = x-9 -+ |-9] + |-... 
10) 100 


where the terms are carried out until they are 0. This can be proved by essentially the same technique used 


above. 


A variation of the above algorithm is to use a base four analogue of Equation (1) as a substitute for the second 
executable line of Figure 5-2: 


xX = xX - 3*((x >> 2) & 0x33333333) 


This code, however, uses the same number of instructions as the line it replaces (six), and requires a fast 
multiply-by-3 instruction. 


An algorithm in HAKMEM memo [HAK, item 169] counts the number of 1-bits in a word by using the first 
three terms of (1) to produce a word of 3-bit fields, each of which contains the number of 1-bits that were in it. 
It then adds adjacent 3-bit fields to form 6-bit field sums, and then adds the 6-bit fields by computing the value 
of the word modulo 63. Expressed in C, the algorithm is (the long constants are in octal) 


int pop(unsigned x) { 
unsigned n; 


n = (x >> 1) & 033333333333; // Count bits in 
X=xX- n; // each 3-bit 

n = (n >> 1) & 9033333333333; // field. 

Xx =X - n; 

x = (x + (x >> 3)) & 030707070707; // 6-bit sums. 

x = modu(x, 63); // Add 6-bit sums. 
return x; 


The last line uses the unsigned modulus function. (It could be either signed or unsigned if the word length were 
a multiple of 3). That the modulus function sums the 6-bit fields becomes clear by regarding the word X as an 


integer written in base 64. The remainder upon dividing a base b integer by b - 1 is, for b 23, congruent mod b 
to the sum of the digits and, of course, is less than b. Because the sum of the digits in this case must be less than 
or equal to 32, mod(x, 63) must be equal to the sum of the digits of x, which is to say equal to the number of 1- 
bits in the original x. 


This algorithm requires only ten instructions on the DEC PDP-10, because that machine has an instruction for 
computing the remainder with its second operand directly referencing a fullword in memory. On a basic RISC, 
it requires about 13 instructions, assuming the machine has unsigned modulus as one instruction (but not 
directly referencing a fullword immediate or memory operand). But it is probably not very fast, because 
division is almost always a slow operation. Also, it doesn't apply to 64-bit word lengths by simply extending 
the constants, although it does work for word lengths up to 62. 


A variation on the HAKMEM algorithm is to use Equation (1) to count the number of 1's in each 4-bit field, 


working on all eight 4-bit fields in parallel [Hay1]. Then, the 4-bit sums may be converted to 8-bit sums in a 
straightforward way, and the four bytes can be added with a multiplication by 0x01010101. This gives 


int pop(unsigned x) { 
unsigned n; 


n = (x >> 1) & Ox77777777T; // Count bits in 
X=X-N // each 4-bit 

n = (n >> 1) & Ox77777777; // field. 

Xx =X - n; 

n = (n >> 1) & Ox77777777; 

Xx =X - Nn; 

xX = (x + (x >> 4)) & OXOFOFOFOF; // Get byte sums. 
x = xX*0x01010101; // Add the bytes. 


return x >> 24; 


This is 19 instructions on the basic RISC. It works well if the machine is two-address, because the first six lines 
can be done with only one move register instruction. Also, the repeated use of the mask OX7/7777777 


permits loading it into a register and referencing it with register-to-register instructions. Furthermore, most of 
the shifts are of only one position. 


A quite different bit-counting method, illustrated in Figure 5-3, is to turn off the rightmost 1-bit repeatedly 
[Weg, RND], until the result is 0. It is very fast if the number of 1-bits is small, taking 2 + 5pop(x) instructions. 


Figure 5-3 Counting 1-bits in a sparsely populated word. 


int pop(unsigned x) { 
int n; 


n = 0; 

while (x != 0) { 
n=n+4+ 1; 
X x& (x = 1); 


} 


return n; 


This has a dual algorithm that is applicable if the number of 1-bits is expected to be large. The dual algorithm 
keeps turning on the rightmost 0-bit with X =X | (X + 1), until the result is all 1's (-1). Then, it returns 32 - 


n. (Alternatively, the original number x can be complemented, or n can be initialized to 32 and counted down). 


A rather amazing algorithm is to rotate x left one position, 31 times, adding the 32 terms [MM]. The sum is the 
negative of pop(x)! That is, 


Equation 2 


pop(x) = — F (x <i), 


where the additions are done modulo the word size, and the final sum is interpreted as a two's-complement 
integer. This is just a novelty; it would not be useful on most machines because the loop is executed 31 times 
and thus it requires 63 instructions plus the loop-control overhead. 


To see why Equation (2) works, consider what happens to a single 1-bit of x. It gets rotated to all positions, and 


when these 32 numbers are added, a word of all 1-bits results. But this is -1. To illustrate, consider a 6-bit word 
size and x = 001001 (binary): 


001001 x 

010010 x 
100100 x2 
001001 x3 
010010 x4 
100100 s&s 


Of course, rotate-right would work just as well. 


The method of Equation (1) is very similar to this "rotate and sum" method, which becomes clear by rewriting 
(1) as 


pop(x) = x- ¥ (xi). 


This gives a slightly better algorithm than Equation (2) provides. It is better because it uses shift right, which is 
more commonly available than rotate, and because the loop can be terminated when the shifted quantity 
becomes 0. This reduces the loop-control code and may save a few iterations. The two algorithms are 
contrasted in Figure 5-4. 


Figure 5-4 Two similar bit-counting algorithms. 


int pop(unsigned x) { 
int i, sum; 


// Rotate and sum method // Shift right & subtract 
sum = X; // sum = x; 
for (i = 1; i <= 31; i++) { // while (x != 0) { 
x = rotatel(x, 1); A X = Xx >> 1; 
sum = sum + X; // sum = sum - X; 
J if 5 
return -sum; // return sum; 
J 


A less interesting algorithm that may be competitive with all the algorithms for pop(x) in this section is to have 
a table that contains pop(x) for, say, x in the range 0 to 255. The table can be accessed four times, adding the 
four numbers obtained. A branch-free version of the algorithm looks like this: 


int pop(unsigned x) { // Table lookup. 
static char table[256] = { 
0, 1, ly 2, Ly 2; ay 3, be 2, 2, 3, 2i 3, 3, 4, 


4, 5, 5, 6; 5, 6; 6, Ti 5, 6, 6, ty, 6, T, T, 8}; 


return table[x & OXFF] + 
table[(x >> 8) & OxFF] + 
table[(x >> 16) & OXxFF] + 
table[(x >> 24)]; 


Item 167 in [HAK] contains a short algorithm for counting the number of 1-bits in a 9-bit quantity that is right- 
adjusted and isolated in a register. It works only on machines with registers of 36 or more bits. Below is a 
version of that algorithm that works on 32-bit machines, but only for 8-bit quantities. 


x = Xx * O0x08040201; // Make 4 copies. 


=X >> 3; // So next step hits proper bits. 
= Xx & 0x11111111; // Every 4th bit. 

= x * 0x11111111; // Sum the digits (each © or 1). 
= X >> 28; // Position the result. 


x x XxX X 
| 


A version for 7-bit quantities is 


* 0x02040810; // Make 4 copies, left-adjusted. 
& ©x11111111; // Every 4th bit. 

* 0x11111111; // Sum the digits (each © or 1). 
>> 28° // Position the result. 


x x XxX X 
x x XxX X 


In these, the last two steps may be replaced with steps to compute the remainder of X modulo 15. 


These are not particularly good; most programmers would probably prefer to use table lookup. The latter 
algorithm above, however, has a version that uses 64-bit arithmetic, which might be useful for a 64-bit machine 
that has fast multiplication. Its argument is a 15-bit quantity. (I don't believe there is a similar algorithm that 
deals with 16-bit quantities, unless it is known that not all 16 bits are 1.) The data type Long Long is a GNU 
C extension [Stall], meaning twice as long as an int, in our case 64 bits. The suffix ULL makes unsigned 
long Long constants. 


int pop(unsigned x) { 
unsigned long long y; 
y = X * 0x0002000400080010ULL ; 
y y & 0X1111111111111111ULL; 
y = y * 0X1111111111111111ULL; 
y = y >> 60; 
return y; 


Counting 1-Bits in an Array 


The simplest way to count the number of 1-bits in an array of fullwords, in the absence of the population count 
instruction, is to use a procedure such as that of Figure 5-2 on page 66 on each word of the array, and simply 
add the results. 


Another way, which may be faster, is to use the first two executable lines of that procedure on groups of three 
words of the array, adding the three partial results. Because each partial result has a maximum value of 4 in 
each 4-bit field, adding three of these partial results gives a word with at most 12 in each 4-bit field, so no sum 
overflows into the field to its left. Next, each of these partial results may be converted into a word having four 
8-bit fields with a maximum value of 24 in each field, using 


x = (x & OXOFOFOFOF) + ((x >> 4) & OXOFOFOFOF) ; 


As these words are produced, they may be added until the maximum value is just less than 255; this would 


allow summing ten such words ( L255/ 24 | }-When ten such words have been added, the result may be 
converted into a word having two 16-bit fields, with a maximum value of 240 in each, with 


x = (X & OxXOOFFOOFF) + ((x >> 8) & OXOOFFOOFF); 


Lastly, 273 such words ( L65 535/ 240] Joan be added together until it is necessary to convert the sum to a 
word consisting of just one 32-bit field, with 


x = (X & OXOOOOFFFF) + (x >> 16); 


In practice, the instructions added for loop control significantly detract from what is saved, so it is probably 
overkill to follow this procedure to the extent described. The code of Figure 5-5 applies the idea with only one 
intermediate level. First, it produces words containing four 8-bit partial sums. Then, after these are added 
together as much as possible, a fullword sum is produced. The number of words of 8-bit fields that can be 


added with no danger of overflow is L255/ 8] = 31. 
Figure 5-5 Counting 1-bits in an array. 
int pop_array(unsigned A[], int n) { 


int i, j, lim; 
unsigned s, S8, X; 


s = 0; 
for (i = 0; 1 < n; 1 = i + 31) { 

lim = min(n, i + 31); 

s8 = 0; 

for (J = 1; J < lim; j++) 4 

x = All; 

x =x - ((X >> 1) & 0x55555555); 
xX 
x 


(x & 0Xx33333333) + ((x >> 2) & 0Xx33333333); 
(x + (x >> 4)) & OXOFOFOFOF; 
s8 = s8 + X; 


(s8 & OXOOFFOOFF) + ((S8 >> 8) & OXOOFFOOFF); 
(x & OxOO00TFFF) + (x >> 16); 
S + X; 


N XxX KW 
II 


} 


return sS; 


This algorithm was compared to the simple loop method by compiling the two procedures with GCC to a target 
machine that is very similar to the basic RISC. The result is 22 instructions per word for the simple method, and 
17.6 instructions per word for the method of Figure 5-5, a savings of 20%. 


Applications 


An application of the population count function is in computing the "Hamming distance" between two bit 
vectors, a concept from the theory of error-correcting codes. The Hamming distance is simply the number of 
places where the vectors differ; that is, 


dist(x, y) = pop(x ® y). 


See, for example, the chapter on error-correcting codes in [Dewd]. 


Another application is to allow reasonably fast direct-indexed access to a moderately sparse array A that is 
represented in a certain compact way. In the compact representation, only the defined, or nonzero, elements of 
the array are stored. There is an auxiliary bit string array bits of 32-bit words, which has a 1-bit for each index i 
for which A[i] is defined. As a speedup device, there is also an array of words bitsum such that bitsum[j] is the 
total number of 1-bits in all the words of bits that precede entry j. This is illustrated below for an array in which 
elements 0, 2, 32, 47, 48, and 95 are defined. 


bits hbitsum data 
OxO00000005 0 A(O] 
0x0001 8001 2 A[2] 
OxS000 0000 5 A|32| 
A|47] 

A[48] 

A[95 | 


< 


Given an index i, 0 =i £95, the corresponding index sparse_i into the data array is given by the number of 1- 
bits in array bits that precede the bit corresponding to i. This may be calculated as follows: 


j= i >> 5; Jf) = i32. 


k = i & 31; // k = rem(i, 32); 

mask = 1 << k; // A "1" at position k. 
if ((bits[j] & mask) == 0) goto no_such_element; 
mask = mask - 1; // 1's to right of k. 


sparse_i = bitsum[j] + pop(bits[j] & mask); 
The cost of this representation is two bits per element of the full array. 


Still another application of the population function is in computing the number of trailing 0's in a word (see 
$ on page 84). 


Tati 


Counting Trailing 0's 


5-2 Parity 


The "parity" of a string refers to whether it contains an odd or an even number of 1-bits. The string has "odd 
parity" if it contains an odd number of 1-bits; otherwise, it has "even parity." 


Computing the Parity of a Word 


Here we mean to produce a 1 if a word x has odd parity, and a 0 if it has even parity. This is the sum, modulo 2, 
of the bits of x—that is, the exclusive or of all the bits of x. 


One way to compute this is to compute pop(x); the parity is the rightmost bit of the result. This is fine if you 
have the population count instruction, but if not, there are better ways than using the code for pop(x). 


A rather direct method is to compute 


nl 
re (xsi), 
i= 


where n is the word size, and then the parity of x is given by the rightmost bit of y. (Here Daenotes exclusive 
or, but for this formula, ordinary addition could be used.) 


The parity may be computed much more quickly, for moderately large n, as follows (illustrated for n = 32; the 
shifts can be signed or unsigned): 


Equation 3 

y= x (x >> ll} 
y= y^ {Y == 2}; 
y= y ^ (y >> 4); 
yY=yY ^ (y¥ >> B); 
Y= y ” (y 2516); 


This executes in ten instructions, as compared to 62 for the first method, even if the implied loop is completely 
unrolled. Again, the parity bit is the rightmost bit of y. In fact, with either of these, if the shifts are unsigned, 
then bit i of y gives the parity of the bits of x at and to the left of i. Furthermore, because exclusive or is its own 


inverse, X; By, is the parity of bits i - 1 through j, for i 2j. 


This is an example of the "parallel prefix," or "scan" operation, which has applications in parallel computing 
[KRS; HS]. Given a sufficient number of processors, it can convert certain seemingly serial processes from O 
(n) to O(log, n) time. For example, if you have an array of words and you wish to compute the exclusive or 


scan operation on the entire array of bits, you can first use (3) on each word of the array, and then use 
essentially the same technique on the array, doing exclusive or's on the words of the array. This takes more 
elementary (word length) exclusive or operations than a simple left-to-right process, and hence it is not a good 
idea for a uniprocessor. But on a parallel computer with a sufficient number of processors, it can do the job in O 
(log, n) rather than O(n) time (where n is the number of words in the array). 


A direct application of (3) is the conversion of an integer to Gray code (see page 236). 


If the code (3) is changed to use left shifts, the parity of the whole word x winds up in the leftmost bit position, 
and bit i of y gives the parity of the bits of x at and to the right of position i. 


If rotate shift's are used, the result is a word of all 1's if the parity of x is odd, and of all 0's if even. 


The following method executes in nine instructions and computes the parity of x as the integer 0 or 1 (the shifts 
are unsigned). 


x A (x oe 1); 
(x ^ (Xx >> 2)) & 0x11111111; 
= X*0x11111111; 
(x >> 28) & 1; 


O XxXxX XxX 


After the second statement above, each hex digit of x is 0 or 1, according to the parity of the bits in that hex 
digit. The multiply adds these digits, putting the sum in the high-order hex digit. There can be no carry out of 
any hex column during the add part of the multiply, because the maximum sum of a column is 8. 


The multiply and shift could be replaced by an instruction to compute the remainder after dividing x by 15, 
giving a (slow) solution in eight instructions, if the machine has remainder immediate. 


Adding a Parity Bit to a 7-Bit Quantity 
Item 167 in [HAK] contains a novel expression for putting even parity on a 7-bit quantity that is right-adjusted 


and isolated in a register. By this we mean to set the bit to the left of the seven bits, to make an 8-bit quantity 
with even parity. Their code is for a 36-bit machine, but it works on a 32-bit machine as well. 


modu((x + 0x10204081) & Ox888888FF, 1920) 


Here, modu(a, b) denotes the remainder of a upon division by b, with the arguments and result interpreted as 
unsigned integers, "*" denotes multiplication modulo and the constant 1920 is 15 -27. Actually, this computes 
the sum of the bits of x, and places the sum just to the left of the seven bits comprising x. For example, the 
expression maps 0x0000007F to 0x000003FF, and 0x00000055 to 0x00000255. 


Another ingenious formula from [HAK] is the following, which puts odd parity on a 7-bit integer: 


modu((x + 0x00204081) | 0x3DB6 DB00, 1152), 


where 1152 = 9 - 27. To understand this, it helps to know that the powers of 8 are + 1 modulo 9. If the 
0x3DB6DB00 is changed to 0xBDB6DBO00, this formula applies even parity. 


These methods are not practical on today's machines, because memory is cheap but division is still slow. Most 
programmers would compute these functions with a simple table lookup. 


Application 


The parity operation is useful in multiplying bit matrices in GF(2) (in which the add operation is exclusive or). 


5-3 Counting Leading 0's 


There are several simple ways to count leading O's with a binary search technique. Below is a model that has 
several variations. It executes in 20 to 29 instructions on the basic RISC. The comparisons are 
"logical" (unsigned integers). 


if (x == 0) return(32); 
n= 0; 


if (x <= OxOOOOFFFF) {n = n +16; x = x <<16;} 

if (x <= OxOOFFFFFF) {n = n+ 8; x = x << 83} 

if (x <= OxOFFFFFFF) {n= n+ 4; x = x << 43} 

if (x <= Ox3FFFFFFF) {n= n+ 2; x = x << 27} 

if (x <= OxX7FFFFFFF) {n = n + 1;} 

return n; 

One variation is to replace the comparisons with and's: 

if ((x & OxFFFF0000) == 0) {n = n +16; x = x <<16;} 
if ((x & OxFF000000) == 0) {n = n + 8; x = x << 8} 


Another variation, which avoids large immediate values, is to use shift right instructions. 


The last if statement is simply adding 1 to N if the high-order bit of X is 0, so an alternative, which saves a 
branch instruction, is: 


n=n+1 - (x >> 31); 


The "+ 1" in this assignment can be omitted if N is initialized to 1 rather than to 0. These observations lead to 
the algorithm (12 to 20 instructions on the basic RISC) shown in Figure 5-6. A further improvement is possible 
for the case in which X begins with a 1-bit: change the first line to 


if ((int)x <= 0) return (~x >> 26) & 32; 
Figure 5-6 Number of leading zeros, binary search. 


int nlz(unsigned x) { 
int n; 


if (x == 0) return(32); 


if ((x >> 16) == 0) {n = n +16; x = x <<16;} 
if ((x >> 24) == 0) {n =n + 8; x = x << 87} 
if ((X => 28) ==: 9) 1 Sh +A aS ees A] 
if ((x >> 30) == 0) {n= n+ 2; xX = x << 2;} 


Figure 5-7 illustrates a sort of reversal of the above. It requires fewer operations the more leading 0's there are, 
and avoids large immediate values and large shift amounts. It executes in 12 to 20 instructions on the basic 


RISC. 


Figure 5-7 Number of leading zeros, binary search, counting down. 


int nlz(unsigned x) { 
unsigned y; 


int n; 

ma g2: 

y = x >>16; if (y != 0) {n =n -16; x= y;} 
y =x >> 8; 1f (y 0) {n= m- 8; X= y;j 
y =x era, If (y 1= 0) {n= n = 4; x= yj 
y =x >> 2; iF (y t= 0) {n= fn =- 27 xX=y;} 
y=x >>1; if (y != 0) return nN - 2; 

return n - x; 


This algorithm is amenable to a "table assist": the last four executable lines can be replaced by 


static char table[256] = {0,1,2,2,3,3,3,3,4,4,...,8); 
return n - table[x]; 


Many algorithms can be aided by table lookup, but this will not often be mentioned here. 

For compactness, this and the preceding algorithms in this section can be coded as loops. For example, the 
algorithm of Figure 5-7 becomes the algorithm shown in Figure 5-8. This executes in 23 to 33 basic RISC 
instructions, ten of which are conditional branches. 


Figure 5-8 Number of leading zeros, binary search, coded as a loop. 


int nlz(unsigned x) { 


unsigned y; 


int Me <<; 
n = 32; 

c = 16; 
do { 


y=x>>c; if (y !=0) {n=n-c; x= y;} 
C=c >> 1; 

} while (c != 0); 

return Nn - xX; 


One can, of course, simply shift left one place at a time, counting, until the sign bit is on; or shift right one 
place at a time until the word is all 0. These algorithms are compact and work well if the number of leading 0's 
is expected to be small or large, respectively. One can combine the methods, as shown in Figure 5-9. We 
mention this because the technique of merging two algorithms and choosing the result of whichever one stops 
first is more generally applicable. It leads to code that runs fast on superscalar and VLIW machines, because of 
the proximity of independent instructions. (These machines can execute two or more instructions 
simultaneously, provided they are independent.) 


Figure 5-9 Number of leading zeros, working both ends at the same time. 


int nlz(int x) { 


int y, n; 
n = 0; 
Y= X, 


L: if (x < 0) return n; 
if (y == 0) return 32 - n; 
n=n+1; 
Kos x << 1; 
y= y >> 1; 
goto L; 


On the basic RISC, this executes in min(3 + 6nlz(x), 5 + 6(32 - nlz(x))) instructions, or 99 worst case. 
However, one can imagine a superscalar or VLIW machine executing the entire loop body in one cycle if the 
comparison results are obtained as a by-product of the shifts, or in two cycles otherwise, plus the branch 
overhead. 


It is straightforward to convert either of the algorithms of Figure 5-6 or Figure 5-7 to a branch-free counterpart. 
Figure 5-10 shows a version that does the job in 28 basic RISC instructions. 


Figure 5-10 Number of leading zeros, 


int nlz(unsigned x) { 
int y, m, n; 


y = -(x >> 16); // 

m= (y >> 16) & 16; // 

n = 16 - m; // 

X = X >> M; // 
Jf 

y = x - 0x100; // 

m = (y >> 16) & 8; // 

n=n+m; 

X= x << m; 

y = Xx - 0x1000; Jf 

m= (y >> 16) & 4; EA 

n=n + m 

xX = x << m; 

y = x - 0x4000; 77 

m= (y >> 16) & 2; KA 

n=nNn+m; 

Xx =x << m; 

y = x >> 14; // 


m=y& ~(y >>= 1); Fe 
return n + 2 = m; 


If your machine has the population count instruction, a good way to compute the number of leading zeros 
function is given in Figure 5-11. The five assignments to X may be reversed, or in fact done in any order. This 
is branch-free and takes 11 instructions. Even if population count is not available, this algorithm may be useful. 
Using the 21-instruction code for counting 1-bits given in Figure 5-2 on page 66, it executes in 32 branch-free 


basic RISC instructions. 


Figure 5-11 Number of leading zeros, 


int nlz(unsigned x) { 
int pop(unsigned x); 


branch-free binary search. 


If left half of x is 0, 

set n = 16. If left half 

is nonzero, set n = 0 and 
shift x right 16. 

Now x is of the form 0000xxxx. 
If positions 8-15 are 0, 

add 8 to n and shift x left 8. 


If positions 12-15 are 0, 
add 4 to n and shift x left 4. 


If positions 14-15 are 0, 
add 2 to n and shift x left 2. 


Set y = 0, 1, 2, or 3. 
Set m ©, 1, 2, or 2 resp. 


right-propagate and count 1-bits. 


XS X | de oF 1); 
xax | (x >> 2); 
x= x | (x =a]; 
Lex | >> 8]; 
x x | (x >>16); 


return pop(~x); 


Floating-Point Methods 


The floating-point post-normalization facilities can be used to count leading zeros. It works out quite well with 
IEEE-format floating-point numbers. The idea is to convert the given unsigned integer to double-precision 
floating-point, extract the exponent, and subtract it from a constant. Figure 5-12 illustrates a complete 


procedure for this. 
Figure 5-12 Number of leading zeros, using IEEE floating-point. 


int nlz(unsigned k) { 
union { 
unsigned asInt[2]; 
double asDouble; 
ti 


int n; 


asDouble = (double)k + 0.5; 
n = 1054 - (asInt[LE] >> 20); 
return n; 


The code uses the C++ "anonymous union" to overlay an integer with a double-precision floating-point 
quantity. Variable LE must be 1 for execution on a little-endian machine, and 0 for big-endian. The addition of 


0.5, or some other small number, is necessary for the method to work when k = 0. 


We will not attempt to assess the execution time of this code, because machines differ so much in their floating- 
point capabilities. For example, many machines have their floating-point registers separate from the integer 
registers, and on such machines data transfers through memory may be required to convert an integer to 
floating-point and then move the result to an integer register. 


The code of Figure 5-12 is not valid C or C++ according to the ANSI standard, because it refers to the same 
memory locations as two different types. Thus, one cannot be sure it will work on a particular machine and 
compiler. It does work with IBM's XLC compiler on AIX, and with the GCC compiler on AIX and on 
Windows 2000, at all optimization levels (as of this writing, anyway). If the code is altered to do the overlay 


defining with something like 


xx = (double)k + 0.5; 
n = 1054 - (*((unsigned *)&xx + LE) >> 20); 


it does not work on these systems with optimization turned on. This code, incidentally, violates a second ANSI 
standard, namely, that pointer arithmetic can be performed only on pointers to array elements [Cohen]. The 
failure, however, is due to the first violation, involving overlay defining. 


JEN] 
In spite of the flakiness of this code, three variations are given below. 


[1] The flakiness is due to the way C is used. The methods illustrated would be perfectly acceptable if coded in 
machine language, or generated by a compiler, for a particular machine. 


asDouble = (double)k; 
n = 1054 - (asInt[LE] >> 20); 
n (n & 31) + (n > 9); 


k = k & ~(k >> 1); 
asFloat = (float)k + 0.5f; 
n = 158 - (asInt >> 23); 


k = k & ~(k >> 1); 
asFloat = (float)k; 
n = 158 - (asInt >> 23); 
n= (n & 31) + (n >> 6); 


In the first variation, the problem with k = 0 is fixed not by a floating-point addition of 0.5, but by integer 
arithmetic on the result N (which would be 1054, or 0x41E, if the correction were not done). 


The next two variations use single-precision floating-point, with the "anonymous union" changed in an obvious 
way. Here there is a new problem: Rounding can throw off the result when the rounding mode is either round to 
nearest (almost universally used) or round toward + ©. For round to nearest mode, the rounding problem 
occurs for K in the ranges hexadecimal FFFFFF80 to FFFFFFFF, 7FFFFFCO to 7FFFFFFF, 3FFFFFEO to 
3FFFFFFF, and so on. In rounding, an add of 1 carries all the way to the left, changing the position of the most 
significant 1-bit. The correction steps used above clear the bit to the right of the most significant 1-bit, blocking 
the carry. 


The GNU C/C++ compiler has a unique feature that allows coding any of these schemes as a macro, giving in- 
line code for the function references [Stall]. This feature allows statements, including declarations, to be 
inserted in code where an expression is called for. The sequence of statements would usually end with an 
expression, which is taken to be the value of the construction. Such a macro definition is shown below, for the 


first single-precision variation. (In C, it is customary to use uppercase for macro names.) 


#define NLZ(k) \ 

({union {unsigned _asInt; float _asFloat;}; \ 
unsigned _kk = (k) & ~((unsigned)(k) >> 1); \ 
_asFloat = (float) kk + 0.5f; \ 

158 - (_asInt >> 23);}) 


The underscores are used to avoid name conflicts with parameter K; presumably, user-defined names do not 
begin with underscores. 


Relation to the Log Function 


The "nlz" function is, essentially, the "integer log base 2" function. For unsigned x Æo, 


31 —nlz(x), and 


32 — nlz{x — 1). 


[log (x) | 
[ log (x) | 


See also Section 11- 4, "Integer Logarithm," on page 215. 


Another closely related function is bitsize, the number of bits required to represent its argument as a signed 
quantity in two's-complement form. We take its definition to be 


l, x = -lor0O, 
2, x =—2orl, 
bitsi _ | 3, —4exS-J3or2ex53, 
ane en ee ree ees 
32, —23le ys —2 41 or 24 xe 25-1], 


From this definition, bitsize(x) = bitsize(- x - 1). But - x - 1 = ~x, so an algorithm for bitsize is (where the shift 
is signed) 


X= x^ (x > 31); // If (x<0) x= -x - 1; 


return 33 - nlz(x); 


Applications 


Two important applications of the number of leading zeros function are in simulating floating-point arithmetic 
operations and in various division algorithms (see Figure 9-1 on page 141, and Figure 9-3 on page 152). The 
instruction seems to have a miscellany of other uses. 


It can be used to get the "x = y " predicate in only three instructions (see "Comparison Predicates" on page 21), 
and as an aid in computing certain elementary functions (see pages 205, 208, 214, and 218). 


A novel application is to generate exponentially distributed random integers by generating uniformly 
distributed random integers and taking "nlz" of the result [GLS1]. The result is 0 with probability 1/2, 1 with 
probability 1/4, 2 with probability 1/8, and so on. Another application is as an aid in searching a word for a 
consecutive string of 1-bits (or 0-bits) of a certain length, a process that is used in some disk block allocation 
algorithms. For these last two applications, the number of trailing zeros function could also be used. 


5-4 Counting Trailing 0's 


If the number of leading zeros instruction is available, then the best way to count trailing 0's is, most likely, to 
convert it to a count leading O's problem: 


32 —nlz(ax & (x—1)). 


If population count is available, a slightly better method is to form a mask that identifies the trailing 0's, and 
count the 1-bits in it [Hay2], such as 


pop(—x & (x=-1)), and 
32 —pop(x | —x). 


Variations exist using other expressions for forming a mask that identifies the trailing zeros of x, such as those 
given in Section 2-1,"Manipulating Rightmost Bits" on page 11. These methods are also reasonable even if the 


machine has none of the bit-counting instructions. Using the algorithm for pop(x) given in Figure 5-2 on page 
66, the first expression above executes in about 3 + 21 = 24 instructions (branch-free). 


Figure 5-13 shows an algorithm that does it directly, in 12 to 20 basic RISC instructions (for x Æo). 
Figure 5-13 Number of trailing zeros, binary search. 


int ntz(unsigned x) { 
int n; 


if (x == 0) return(32); 
n= 1; 


if ((x & OxOOOOFFFF) == 0) {n = n +16; x = x >>16;} 
if ((x & 0x000000FF) == 0) {n = n + 8; X = X >> 87} 
if ((X & 0x0000000F) == 0) {n =n + 4; xX = x >> 4;} 
if ((x & 0x00000003) == 0) {n =n + 2; x = xX >> 2;} 


return n - (x & 1); 


The Nn + 16 can be simplified to 17 if that helps, and if the compiler is not smart enough to do that for you 


(this does not affect the number of instructions as we are counting them). 


Figure 5-14 shows a variation that uses smaller immediate values and simpler operations. It executes in 12 to 
21 basic RISC instructions. Unlike the above procedure, when the number of trailing 0's is small, the one in 
Figure 5-14 executes a larger number of instructions, but also a larger number of "fall through" branches. 


Figure 5-14 Number of trailing zeros, smaller immediate values. 
int ntz(unsigned x) { 

unsigned y; 

int n; 


if (x == 0) return 32; 


n síi; 

y =x <<16; if (y != 0) {n =n -16; x= y;} 
yax«<<8; ar (y f= 0] ine n= 8; X= yi 
y=x << 4; if (y !=0){n=n- 4; X=y,;} 
y=x << 2; if (y !=0) {n=n-2; x=y;} 
y=x << 1; if (y != 0) {n=n - 13} 

return n; 


The line just above the return statement may alternatively be coded 
nh =n = (Cx << 2) >> 31); 
which saves a branch, but not an instruction. 


In terms of number of instructions executed, it is hard to beat the "search tree" [Aus2]. Figure 5-15 illustrates 
this procedure for an 8-bit argument. This procedure executes in seven instructions for all paths except the last 
two (return 7 or 8), which require nine. A 32-bit version would execute in 11 to 13 instructions. Unfortunately, 
for large word sizes the program is quite large. The 8-bit version above is 12 lines of executable source code 
and would compile into about 41 instructions. A 32-bit version would be 48 lines and about 164 instructions, 
and a 64-bit version would be twice that. 


Figure 5-15 Number of trailing zeros, binary search tree. 


int ntz(char x) { 
if (x & 15) { 
if (xX & 3) { 
if (x & 1) return 0; 
else return 1; 


else if (x & 4) return 2; 
else return 3; 


else if (x & 0x30) { 
if (x & 0x10) return 4; 
else return 5; 


J 
else if (x & 0x40) return 6; 


else if (x) return 7; 
else return 8; 


If the number of trailing 0's is expected to be small or large, then the simple loops below are quite fast. The left- 
hand algorithm executes in 5 + 3ntz(x) and the right one in 3 + 3(32 - ntz(x)) basic RISC instructions. 
Figure 5-16 Number of trailing zeros, simple counting loops. 


int ntz(unsigned x) { 


int n; 

x= ~X & (x - 1); 

n = 0; // n = 32; 

while(x != 0) { // while (x != 0) { 
n=n+ 1; // n=n - 1; 
x= x >> 1; // X =X + X; 

} // } 

return n; // return n; 


It is interesting to note that if the numbers x are uniformly distributed, then the average number of trailing 0's is, 
very nearly, 1.0. To see this, sum the products p,n;, where p; is the probability that there are exactly nį trailing 


O's. That is, 


To evaluate this sum, consider the following array: 


44 178 1/716 1/32 1/64 
1/78 1/16 1/32 1/64 
1/16 1/32 1/64 
1732 1/64 

Irog 


The sum of each column is a term of the series for S. Hence S is the sum of all the numbers in the array. The 
sum of the rows are 

1/44+1/84+1/164+ 1/327... = 172 

1/8+1/164+1/324+1/64+... = 1/4 

1/16 + 1/32 + 1/64+ 1/128 +... = 1/8 


and the sum of these is 1/2 + 1/4 + 1/8 + ... = 1. The absolute convergence of the original series justifies the 
rearrangement. 


Sometimes, a function similar to ntz(x) is wanted, but a 0 argument is a special case, perhaps an error, that 
should be identified with a value of the function that's easily distinguished from the "normal" values of the 
function. For example, let us define "the number of factors of 2 in x" to be 


ntz(x), «#0, 


fact2(x) = 


This can be calculated from 


31 =- nlz(x & =x). 


Applications 


[GLS1] points out some interesting applications of the number of trailing zeros function. It has been named the 


"ruler function" by Eric Jensen, because it gives the height of a tick mark on a ruler that's divided into halves, 
quarters, eighths, and so on. 


It has an application in R. W. Gosper's loop-detection algorithm, which will now be described in some detail 
because it is quite elegant and it does more than might at first seem possible. 


Suppose a sequence Xp, X1, Xo, ... is defined by X, + = f(X,,). If the range of fis finite, the sequence is 


necessarily periodic. That is, it consists of a leader Xo, Xj, ..., Xy _ 1 followed by a cycle Xp X+ sA TEP 


X 


that repeats without limit (X, =X X Tete 


p+ Xp+17 and so on, where A is the period of the cycle). Given 


the function f, the loop-detection problem is to find the index u of the first element that repeats, and the period 
À. Loop detection has applications in testing random number generators and detecting a cycle in a linked list. 


One could save all the values of the sequence as they are produced, and compare each new element with all the 
preceding ones. This would immediately show where the second cycle starts. But algorithms exist that are 
much more efficient in space and time. 


Perhaps the simplest is due to R. W. Floyd [Knu2, sec. 3.1, prob. 6]. This algorithm iterates the process 


x = fQ) 
y = fC FO) 


with x and y initialized to Xp. After the nth step, x = X,, and y = Xay These are compared, and if equal, it is 
known that X,, and xX, are separated by an integral multiple of the period A—that is, 2n - n = n is a multiple of 


À. Then pt can be determined by regenerating the sequence from the beginning, comparing Xp to Xp, then X, to 
Xn+1, and so on. Equality occurs when X,, is compared to X, + p- Finally, À can be determined by regenerating 
more elements, comparing X, to Xp + 1 Xy +2, --. This algorithm requires only a small and bounded amount of 


space, but it evaluates f many times. 


Gosper's algorithm [HAK, item 132; Knu2, Answers to Exercises for sec. 3.1, prob. 7] finds the period A, but 
not the starting point u of the first cycle. Its main feature is that it never backs up to reevaluate f, and it is quite 


economical in space and time. It is not bounded in space; it requires a table of size log (^) + 1, where A is the 


largest possible period. This is not a lot of space; for example, if it is known a priori that A S932, then 33 
words suffice. 


Gosper's algorithm, coded in C, is shown in Figure 5-17. This C function is given the function f being analyzed 
and a starting value Xp. It returns lower and upper bounds on u, and the period A. (Although Gosper's algorithm 


cannot compute u, it can compute lower and upper bounds u and u such thatu -u +1 Smax(A - 1, 1).) The 
l u u 'l 


algorithm works by comparing X,,, for n = 1, 2, ..., to a subset of size Llog, n] +] of the elements of the 
sequence that precede X,,. The elements of the subset are the closest preceding X; such that i + 1 ends in a 1-bit 
(that is, i is the even number preceding n), the closest preceding X; such that i + 1 ends in exactly one O-bit, the 


closest preceding X; such that i + 1 ends in exactly two 0-bits, and so on. 


Figure 5-17 Gosper's loop-detection algorithm. 


void ld_Gosper(int (*f)(int), int X0, int *mu_l, 
int *mu_u, int *lambda) { 
int Xn, k, m, kmax, n, lgl; 


int TL33]; 
T[0] = XO; 
Xn = XO; 
for (n = 1; } nt) 4 
Xn = f(Xn); 
kmax = 31 - nlz(n); // Floor(log2 n). 


for (k = 0; k <= kmax; k++) { 
if (Xn == T[k]) goto L; 


T[ntz(n+1)] = Xn; // No match. 
} 


// Compute m = max{i | i < n and ntz(i+1) = k}. 


= ((((n >> k) - 1) | 1) << k) - 1; 


ean = - Mm; 
lgl = 31 - e - 1); // Ceil(log2 lambda) - 1. 
*mu_u = m; // Upper bound on mu. 


*mu_l = m - max(1, 1 << lgl) + 1;// Lower bound on mu. 


Thus, the comparisons proceed as follows: 


ran, ae, ae eee Re aa as 
Xa i Xqy X Xy Ko V5, X35, A7 Nig? Xin Xis Ai Ay 
X; : Xa X] Xy i Xg, Xy Xy Xa Xis t Xip Xin Xi X? 
Xa AnA A3 Xip: Agp Ap Ap AT Xis X14 Xis Aip Ap X15 
Xs i Xy Ap A3 Xiii Xip Agp Az A7 A17 i Kie Ags Ay» Aq As 


Xe i Ap Xs, A3 X12: X19; Agp AXA Nig i Nig X17, Ab Xp X15 


It can be shown that the algorithm always terminates with n somewhere in the second cycle—that is, with n < u 
+ 2h. See [Knu2] for further details. 


The ruler function reveals how to solve the Tower of Hanoi puzzle. Number the n disks from 0 to n - 1. At each 
move k, as k goes from 1 to 2” - 1, move disk ntz(k) the minimum permitted distance to the right, in a circular 
manner. 


The ruler function can be used to generate a reflected binary Gray code (see Section 13-1 on page 235). Start 
with an arbitrary n-bit word, and at each step k, as k goes from 1 to 2” - 1, flip bit ntz(k). 


Chapter 6. Searching Words 


Find First 0-Byte 


Find First String of 1-Bits of a Given Length 


6-1 Find First 0-Byte 


The need for this function stems mainly from the way character strings are represented in the C language. They 
have no explicit length stored with them; instead, the end of the string is denoted by an all-0 byte. To find the 
length of a string, a C program uses the "strlen" (string length) function. This function searches the string, from 


left to right, for the 0-byte, and returns the number of bytes scanned, not counting the 0-byte. 


A fast implementation of "strlen" might load and test single bytes until a word boundary is reached, and then 
load a word at a time into a register, and test the register for the presence of a 0-byte. On big-endian machines, 
we want a function that returns the index of the first 0-byte from the left. A convenient encoding is values from 
0 to 3 denoting bytes 0 to 3, and a value of 4 denoting that there is no 0-byte in the word. This is the value to 
add to the string length, as successive words are searched, if the string length is initialized to 0. On little-endian 
machines, one wants the index of the first 0-byte from the right end of the register, because little-endian 
machines reverse the four bytes when a word is loaded into a register. Specifically, we are interested in the 
following functions, where "00" denotes a 0-byte, "nn" denotes a nonzero byte, and "xx" denotes a byte that 


may be 0 or nonzero. 


zbytel(x) 


Our first procedure for the find leftmost 0-byte function, shown in Figure 6-1, simply tests each byte, in left-to- 


QO, x = ÜÜXXXXXX, 
l. x = nnOOxxxx, 
= 42, x = nnnnOOxx, 
3. X = nonnnnvo, 
4. x = nnnnnnnn. 


zbyter(x) = 


right order, and returns the result when the first 0-byte is found. 


Figure 6-1 Find leftmost 0-byte, simple sequence of tests. 


int zbytel(unsigned x) { 


if 

else 
else 
else 
else 


This executes in two to 11 basic RISC instructions, 11 in the case that the word has no 0-bytes (which is the 
important case for the "strlen" function). A very similar program will handle the problem of finding the 


((x >> 24) 
if ((X & OxOOFFOOOO) 
if ((X & OXxO000FFOO) 
if ((X & 0x000000FF) 
return 4; 


0) 
0) 
0) 
0) 


return 
return 
return 
return 


UNEO 


‘ʻo 


Ne oS 


fe Da 


+ 


~: Nau 


~a 


a 


xxxxxx00, 
xxxx00nn, 
xx00nnnn, 
O0nnnnnn, 
nnnnnnnn., 


rightmost 0-byte. 


Figure 6-2 shows a branch-free procedure for this function. The idea is to convert each 0-byte to 0x80, and each 
nonzero byte to 0x00, and then use number of leading zeros. This procedure executes in eight instructions if the 
machine has the number of leading zeros and nor instructions. Some similar tricks are described in [Lamp]. 


Figure 6-2 Find leftmost 0-byte, branch-free code. 


int zbytel(unsigned x) { 
unsigned y; 
int n; 
// Original byte: 00 80 other 
y = (X & OX7F/7F7F7F) + OX7F7F7F/7F; // 1F 7F 1XXXXXXX 


y = ~(y | x | OXx7F7F7F7F); // 80 00 00000000 
n = nlz(y) >> 3; //n=0.4.. 4, 4 1f xX 
return n; // has no O0-byte. 


The position of the rightmost 0-byte is given by the number of trailing 0's in the final value of Y computed 


above, divided by 8 (with fraction discarded). Using the expression for computing the number of trailing 0's by 
means of the number of leading zeros instruction (see Section 5- 4, "Counting Trailing 0's," on page 84), this 
can be computed by replacing the assignment to N in the procedure above with: 


n= (32 +-niz(-y@ (y = 1))) >> 3; 
This is a 12-instruction solution, if the machine has nor and and not. 


In most situations on PowerPC, incidentally, a procedure to find the rightmost 0-byte would not be needed. 
Instead, the words can be loaded with the load word byte-reverse instruction (Lwbr x). 


The procedure of Figure 6-2 is more valuable on a 64-bit machine than on a 32-bit one, because on a 64-bit 
machine the procedure (with obvious modifications) requires about the same number of instructions (seven or 
ten, depending upon how the constant is generated), whereas the technique of Figure 6-1 requires 23 
instructions worst case. 


If only a test for the presence of a 0-byte is wanted, then a branch on zero (or nonzero) can be inserted just after 
the second assignment to Y. 


If the "nlz" instruction is not available, there does not seem to be any really good way to compute the find first 
0-byte function. Figure 6-3 shows a possibility (only the executable part of the code is shown). 


Figure 6-3 Find leftmost 0-byte, not using n1z. 


// Original byte: 00 80 other 


y = (Xx & OX7F/7F/7F7F) + OX7F/F/F/F; // 7F 7F 1XXXXXXX 

y=-~(y | x | OX7F/7F/F/F); // 80 00 00000000 
// These steps map: 

if (y == 0) return 4; // 00000000 ==> 4, 
else if (y > 0X0000FFFF) // 80xxxxxx ==> 0, 
return (y >> 31) ^ 1; // 0080xxxx ==> 1, 
else // 000080xx ==> 2, 
return (y >> 15) ^ 3; // 00000080 ==> 3. 


This executes in ten to 13 basic RISC instructions, ten in the all-nonzero case. Thus, it is probably not as good 
as the code of Figure 6-1, although it does have fewer branch instructions. It does not scale very well to 64-bit 
machines, unfortunately. 


There are other possibilities for avoiding the "nlz" function. The value of Y computed by the code of Figure 6-3 


consists of four bytes, each of which is either 0x00 or 0x80. The remainder after dividing such a number by 
0x7F is the original value with the up-to-four 1-bits moved and compressed to the four rightmost positions. 
Thus, the remainder ranges from 0 to 15 and uniquely identifies the original number. For example, 


remu(Ox8080 8080, 127) = 15, 
remu(Ox8000 0000, 127) = 8, 
remu(Ox0000 8080, 127) = 3, ete. 


This value can be used to index a table, 16 bytes in size, to get the desired result. 
Thus, the code beginning if (y == ©) can be replaced with 


static char table[16] = {4, 3, 2, 2, 1, 1, 1, 1, 
©, ©, ©, ©, O, ©, O, O}; 


4 4 


return table[y%127]; 
where Y is unsigned. The number 31 can be used in place of 127, but with a different table. 


These methods involving dividing by 127 or 31 are really just curiosities, because the remainder function is apt 
to require 20 cycles or more even if directly implemented in hardware. However, below are two more efficient 
replacements for the code in Figure 6-3 beginning with if (y == 0): 


return table[hopu(y, 0x02040810) & 15]; 
return table[y*0x00204081 >> 28]; 


Here, hopu(a, b) denotes the high-order 32 bits of the unsigned product of a and b. In the second line, we 


assume the usual HLL convention that the value of the multiplication is the low-order 32 bits of the complete 
product. This might be a practical method, if either the machine has a fast multiply or the multiplication by 
0x204081 is done by shift-and-add's. It can be done in four such instructions, as suggested by 


y(1+27+214+3221) = (1 + 271 +2"), 


Using this 4-cycle way to do the multiplication, the total time for the procedure comes to 13 cycles (7 to 
compute Y, plus 4 for the shift-and-add's, plus 2 for the shift right of 28 and the table index), and of course it is 
branch-free. 


These scale reasonably well to a 64-bit machine. For the "modulus" method, use 


return table[y%511]; 


where table is of size 256, with values 8, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, ... (i.e, table[i] = 
number of trailing O's in i). 


For the multiplicative methods, use either 


return table[hopu(y, 0x0204081020408100) & 255]; or 
return table[ (y*0x0002040810204081»56 ] ; 


where table is of size 256, with values 8, 7, 6, 6,5,5,5,5,4,4,4,4,4,4,4,4,3,.... 
The multiplication by 0x2 0408 1020 4081 can be done with 


tie (1+ 27) 
t,e t (1 +2") 


f, — f (1+2) 


which gives a 13-cycle solution. 


All these variations using the table can, of course, implement the find rightmost 0-byte function by simply 
changing the data in the table. 


If the machine does not have the nor instruction, the not in the second assignment to Y in Figure 6-3 can be 
omitted, in the case of a 32-bit machine, by using one of the three return statements given above, with 
table[ i] =0, 0,0, 0, 0, 0, 0, 0,1, 1, 1, 1, 2, 2, 3, 4. This scheme does not quite work on a 64-bit machine. 


Here is an interesting variation on the procedure of Figure 6-2, again aimed at machines that do not have 


number of leading zeros. Let a, b, c, and d be 1-bit variables for the predicates "the first byte of x is nonzero," 
"the second byte of x is nonzero," and so on. Then, 


zbytel(x) = a+ ab+ abec + abed. 


The multiplications can be done with and's, leading to the procedure shown in Figure 6-4 (only the executable 
code is shown). 


Figure 6-4 Find leftmost 0-byte by evaluating a polynomial. 


y = (X & OX7F7F7F7F) + OX7F7F7F7F; 


y=y | x; // Leading 1 on nonzero bytes. 
ti = y >> 31; //ti=a. 

t2 = (y >> 23) & t1; // t2 = ab. 

t3 = (y >> 15) & t2; // t3 = abc. 

t4 = (y >> 7) & t3; // t4 = abcd. 


return t1 + t2 + t3 + t4; 


This comes to 15 instructions on the basic RISC, which is not particularly fast, but there is a certain amount of 
parallelism. On a superscalar machine that can execute up to three arithmetic instructions in parallel, provided 
they are independent, it comes to only ten cycles. 


A simple variation of this does the find rightmost 0-byte function, based on 


zbyter(x) = abed + bed + ed+ d. 


(This requires one more and than the code of Figure 6-4.) 


Some Simple Generalizations 


Functions "zbytel" and "zbyter" can be used to search for a byte equal to any particular value, by first exclusive 
or'ing the argument x with a word consisting of the desired value replicated in each byte position. For example, 


to search x for an ASCII blank (0x20), search x Box20202020 for a 0-byte. 


Similarly, to search for a byte position in which two words x and y are equal, search x p, for a 0-byte. 
There is nothing special about byte boundaries in the code of Figure 6-2 and its variants. For example, to search 


a word for a 0-value in any of the first four bits, the next 12, or the last 16, use the code of Figure 6-2 with the 
mask replaced by 0x77FF7FFF [PHO]. (If a field length is 1, use a 0 in the mask at that position.) 


Searching for a Value in a Given Range 


The code of Figure 6-2 can easily be modified to search for a byte in the range 0 to any specified value less than 
128. To illustrate, the following code finds the index of the leftmost byte having value from 0 to 9: 


y = (x & OxX7F7F7F7F) + Ox76767676; 

y=y |x; 

y = y | OX7F7F7F7F; // Bytes > 9 are OxFF. 

y = =y; // Bytes > 9 are 0x00, 
// bytes <= 9 are 0x80. 

n = nlz(y) >> 3; 


More generally, suppose you want to find the leftmost byte in a word that is in the range a to b, where the 
difference between a and b is less than 128. For example, the uppercase letters encoded in ASCII range from 
0x41 to 0x5A. To find the first uppercase letter in a word, subtract 0x41414141 in such a way that the borrow 
does not propagate across byte boundaries, and then use the above code to identify bytes having value from 0 to 
0x19 (Ox5A - 0x41). Using the formulas for subtraction given in Section 2-17, "Multibyte Add, Subtract, 


Absolute Value," on page 36, with obvious simplifications possible with y = 0x41414141, gives 


d = (x | 0x80808080) - 0x41414141; 

d = ~((x | Ox7F7F7F7F) ^ d); 

y = (d & OX7F7F7F7F) + Ox66666666; 

y=y |d; 

y = y | OX7F7F7F7F; // Bytes not from 41-5A are FF. 

y = ~y; // Bytes not from 41-5A are 00, 
// bytes from 41-5A are 80. 

n = nlz(y) >> 3; 


For some ranges of values, simpler code exists. For example, to find the first byte whose value is 0x30 to 0x39 
(a decimal digit encoded in ASCII), simply exclusive or the input word with 0x30303030 and then use the code 
given above to search for a value in the range 0 to 9. (This simplification is applicable when the upper and 
lower limits have n high-order bits in common, and the lower limit ends with 8 - n 0's.) 


These techniques can be adapted to handle ranges of 128 or larger with no additional instructions. For example, 
to find the index of the leftmost byte whose value is in the range 0 to 137 (0x89), simply change the line y = y 


| X to y = y & X in the code above for searching for a value from 0 to 9. 


Similarly, changing the line y = y | d to y = y & d in the code for finding the leftmost byte whose value is in 
the range 0x41 to Ox5A causes it to find the leftmost byte whose value is in the range 0x41 to OxDA. 


6-2 Find First String of 1-Bits of a Given Length 


The problem here is to search a word in a register for the first string of 1-bits of a given length n or longer, and 
to return its position, with some special indication if no such string exists. Variants are to return only the yes/no 
indication, and to locate the first string of exactly n 1-bits. This problem has application in disk-allocation 
programs, particularly for disk compaction (rearranging data on a disk so that all blocks used to store a file are 
contiguous). The problem was suggested to me by Albert Chang, who pointed out that it is one of the uses for 
the number of leading zeros instruction. 


We assume here that the number of leading zeros instruction, or a suitable subroutine for that function, is 
available. 


An algorithm that immediately comes to mind is to first count the number of leading 0's and skip over them by 
shifting left by the number obtained. Then count the leading 1's by inverting and counting leading 0's. If this is 
of sufficient length, we are done. Otherwise, shift left by the number obtained and repeat from the beginning. 
This algorithm might be coded as shown below. If n consecutive 1-bits are found, it returns a number from 0 to 
31, giving the position of the leftmost 1-bit in the leftmost such sequence. Otherwise, it returns 32 as a "not 
found" indication. 


int ffstri(unsigned x, int n) { 


int k, p; 
p = 0; // Initialize position to return. 
while (x != 0) { 
k = nlz(x); // Skip over initial O's 
X = X << k; // (if any). 
p=p+k; 
k = nlz(~x); // Count first/next group of 1's. 
if (k >= n) // If enough, 
return p; // return. 
x=xX<< k; // Not enough 1's, skip over 
p=p +k; // them. 
} 
return 32; 


This algorithm is reasonable if it is expected that the loop will not be executed very many times—for example, 
if it is expected that X will have long sequences of 1's and of 0's. This might very well be the expectation in the 


disk-allocation application. Its worst-case execution time, however, is not very good; for example, about 178 
full RISC instructions executed for x = 0x55555555 and n 22, 


An algorithm that is better in worst-case execution time is based on a sequence of shift left and and instructions. 
To see how this works, consider searching for a string of eight or more consecutive 1-bits in a 32-bit word x. 
This might be done as follows: 


After the first assignment, the 1's in x indicate the starting positions of strings of length 2. After the second 
assignment, the 1's in x indicate the starting positions of strings of length 4 (a string of length 2 followed by 
another string of length 2). After the third assignment, the 1's in x indicate the starting positions of strings of 
length 8. Executing number of leading zeros on this word gives the position of the first string of length 8 (or 
more), or 32 if none exists. 


To develop an algorithm that works for any length n from 1 to 32, we will look at this a little differently. First, 
observe that the above three assignments may be done in any order. Reverse order will be more convenient. To 
illustrate the general method, consider the case n = 10: 


) 
) 
Xs — As oe (Xa << | ) 
) 


a F By a | 
ay x, & (x, a l 


The first statement shifts by n/2. After executing it, the problem is reduced to finding a string of five 


consecutive 1-bits in x,. This may be done by shifting left by [3/2] =2 ‘and'ing, and searching the result 


for a string of length 3 (5 - 2). The last two statements identify where the strings of length 3 are in x». The sum 


of the shift amounts is always n - 1. The algorithm is shown in Figure 6-5. The execution time ranges from 3 to 
36 full RISC instructions, as N ranges from 1 to 32. 


If N is often moderately large, it is not unreasonable to unroll this loop by repeating the loop body five times 
and omitting the test N > 1. (Five is always sufficient for a 32-bit machine.) This gives a branch-free algorithm 
that runs in a constant time of 20 instructions executed (the last assignment to N can be omitted). Although for 
small values of N the three assignments are executed more than necessary, the result is unchanged by the extra 


steps because variable N sticks at the value 1, and for this value the three steps have no effect on X or N. The 


unrolled version is faster than the looping version for N 2s, in terms of number of instructions executed. 


Figure 6-5 Find first string of n 1's, shift-and-and sequence. 


int ffstri(unsigned x, int n) { 
int sS; 


while (n > 1) { 
s = n > 1; 
x=XxXx& (x << s); 
n=n- S; 

Í 


return nlz(x); 


A string of exactly n 1-bits can be found in six more instructions (four if and not is available). The quantity x 
computed by the algorithm of Figure 6-5 has 1-bits wherever a string of length n or more 1-bits begins. Hence, 
using the final value of x computed by that algorithm, the expression 


x& A(x l)&alx<« 1) 


contains a 1-bit wherever the final x contains an isolated 1-bit, which is to say wherever the original x began a 
string of exactly n 1-bits. 


The algorithm is also easily adapted to finding strings of length n that begin at certain locations. For example, 
to find strings that begin at byte boundaries, simply and the final x with 0x80808080. 


It can be used to find strings of 0-bits either by complementing x at the start, or by changing the and's to or's 
and complementing x just before invoking "nlz." For example, below is an algorithm for finding the first 
(leftmost) 0-byte (see Section 6-1, "Find First 0-Byte," on page 91, for a precise definition of this problem). 


xexw | (x = 4) 
(xÆ) 
xex| (x<1) 
x © 0x7F7F7F7F | x 


x a 


p <—nlz(4x) = 3 


This executes in 12 instructions on the full RISC (not as good as the algorithm of Figure 6-2 on page 92, which 
executes in eight instructions). 


Chapter 7. Rearranging Bits and Bytes 


Reversing Bits and Bytes 
Shuffling Bits 
Transposing a Bit Matrix 


Compress, or Generalized Extract 


General Permutations, Sheep and Goats Operation 


Rearrangements and Index Transformations 


7-1 Reversing Bits and Bytes 
By "reversing bits" we mean to reflect the contents of a register about the middle so that, for example, 


rev(0x01234567) = OxE6GA2 C480. 


By "reversing bytes" we mean a similar reflection of the four bytes of a register. Byte reversal is a necessary 
operation to convert data between the "little-endian" format used by DEC and Intel and the "big-endian" format 
used by most other manufacturers. 


Bit reversal can be done quite efficiently by interchanging adjacent single bits, then interchanging adjacent 2- 
bit fields, and so on, as shown below [Aus1]. These five assignment statements can be executed in any order. 


= (x & 0x55555555) << 1 | (x & OxAAAAAAAA) >> 1 
(x & 0x33333333) << 2 | (x & OXxCCCCCCCC) >> 2 
(x & OXOFOFOFOF) << 4 | (x & OXFOFOFOFO) >> 4; 
8 | 8 
6 | 6 


(x & OXOOFFOOFF) << (x & OXFFOOFFOO) >> 
(x & OXOOOOFFFF) << 1 (x & OxFFFF0000) >> 1 


x x xX X X 
I 


A small improvement results on most machines by using fewer distinct large constants and doing the last two 
assignments in a more straightforward way, as is shown in Figure 7-1 (30 basic RISC instructions, branch-free). 


Figure 7-1 Reversing bits. 


unsigned rev(unsigned x) { 
x = (x & 0x55555555) << 1 | (x >> 1) & 0x55555555; 


x = (X & 0x33333333) << 2 | (x >> 2) & 0x33333333; 
x = (Xx & OXOFOFOFOF) << 4 | (x >> 4) & OXOFOFOFOF; 
X = (X << 24) | ((x & OXFFOO) << 8) | 

((x >> 8) & OxFFOO) | (x >> 24); 
return x; 


The last assignment to x in this code does byte reversal in nine basic RISC instructions. If the machine has 
rotate shifts, however, this can instead be done in seven instructions with 


x = ((x & OxOOFFOOFF) 8) | ((x 8) & 0x00FF00FF). 


PowerPC can do the byte-reversal operation in only three instructions [Hay1]: a rotate left of 8, which positions 
two of the bytes, followed by two "rlwimi" (rotate left word immediate then mask insert) instructions. 


Generalized Bit Reversal 


[GLS1] suggests that the following sort of generalization of bit reversal, which he calls "flip," is a good 
candidate to consider for a computer's instruction set: 


if (k & 1) x = (x & ©0x55555555) << 1 | (x & OXAAAAAAAA) >> 1; 
if (k & 2) x = (x & 0x33333333) << 2 | (x & OxCCCCCCCC) >> 2; 
if (kK & 4) x = (X & OXOFOFOFOF) << 4 | (x & OXFOFOFOFO) >> 4; 
if (k & 8) x= (x & OxXOOFFOOFF) << 8 | (x & OXFFOOFFOO) >> 8; 
if (k & 16) x = (x & OxXOOOOFFFF) << 16 | (x & OxFFFF0000) >> 16; 


(The last two and operations can be omitted.) For K = 31, this operation reverses the bits in a word. For k = 24, 
it reverses the bytes in a word. For k = 7, it reverses the bits in each byte, without changing the positions of the 
bytes. For k = 16, it swaps the left and right halfwords of a word, and so on. In general, it moves the bit at 


position m to position m Bx. It can be implemented in hardware very similarly to the way a rotate shifter is 
usually implemented (five stages of MUX's, with each stage controlled by a bit of the shift amount k). 


Bit-Reversing Novelties 


Item 167 in [HAK] contains rather esoteric expressions for reversing 6-, 7-, and 8-bit integers. Although these 
expressions are designed for a 36-bit machine, the one for reversing a 6-bit integer works on a 32-bit machine, 
and those for 7- and 8-bit integers work on a 64-bit machine. These expressions are as follows: 


6-bit,  remu((x + 0x00082082) & 0x01122408, 255) 
J-bit  Temu((x + 0x40100401) & Ox4 42211008, 255) 
g hit,  Temu((x + Ox2 02020202) & 0x108 84422010, 1023) 


The result of all these is a "clean" integer—right-adjusted with no unused high-order bits set. 


In all these cases the "remu" function can instead be "rem" or "mod," because its arguments are positive. The 
remainder function is simply summing the digits of a base 256 or base 1024 number, much like casting out 
nines. Hence it can be replaced with a multiply and a shift right. For example, the 6-bit formula has the 


following alternative on a 32-bit machine (the multiplication must be modulo 222): 


f+ (x + 0x00082082) & 0x01122408 
(f « Ox0101 0101) = 24 


These formulas are limited in their utility because they involve a remaindering operation (20 cycles or more) 
and/or some multiplications, as well as loading of large constants. The formula immediately above requires ten 
basic RISC instructions, two of which are multiply's, which amounts to about 20 cycles on a present-day RISC. 
On the other hand, an adaptation of the code of Figure 7-1 to reverse 6-bit integers requires about 15 
instructions, and probably about 9 to 15 cycles, depending on the amount of instruction-level parallelism in the 
machine. These techniques, however, do give compact code. Below are a few more techniques that might 
possibly be useful, all for a 32-bit machine. They involve a sort of double application of the idea from [HAK], 
to extend the technique to 8- and 9-bit integers on a 32-bit machine. 


The following is a formula for reversing an 8-bit integer: 


s — (x + 0x02020202) & 0x84422010 
fe {x + 8) & Ox00000420 
remu(s + f, 1023) 


Here the "remu" cannot be changed to a multiply and shift. (You have to work these out, and look at the bit 
patterns, to see why.) 


Here is a similar formula for reversing an 8-bit integer, which is interesting because it can be simplified quite a 
bit: 


s — (x + 0x00020202) & 0x01044010 
te (x + OxO008 0808) & Ox0208 8020 
remu(s + r, 4095) 


The simplifications are that the second product is just a shift left of the first product, the last mask can be 
generated from the second with just one instruction (shift), and the remainder can be replaced by a multiply and 


shift. It simplifies to 14 basic RISC instructions, two of which are multiply's: 


u & x » 0x00020202 
m — 0x01044010 


Seung m 


fe (n2) Amal) 


(0x01001001 + (s+4)) 4 24 


The following is a formula for reversing a 9-bit integer: 


s — (x + 0x01001001) & Ox84108010 
f— (x + Ox0004 0040) & 0x00841080 
remu(s + r, 1023) 


The second multiplication can be avoided because the product is equal to the first product shifted right six 
positions. The last mask is equal to the second mask shifted right eight positions. With these simplifications, 
this requires 12 basic RISC instructions, including the one multiply and one remainder. The remainder 
operation must be unsigned, and it cannot be changed to a multiply and shift. 


The reader who studies these marvels will be able to devise similar code for other bit-permuting operations. As 
a simple (and artificial) example, suppose it is desired to extract every other bit from an 8-bit quantity, and 
compress the four bits to the right. That is, the desired transformation is 


0000 0000 0000 0000 0000 0000 abcd efgh ==> 
0000 0000 0000 0000 0000 0000 0000 bdfh 


This may be computed as follows: 


f e (x + Ox01010101) & 0x40100401 
(t + 0x08040201 ) =. 27 


On most machines, the most practical way to do all these operations is by indexing into a table of 1-byte (or 9- 
bit) integers. 


Incrementing a Reversed Integer 


The Fast Fourier Transform (FFT) algorithm employs an integer i and its bit reversal rev(i) in a loop in which i 
is incremented by 1 [PB]. Straightforward coding would increment i and then compute rev(i) on each loop 
iteration. For small integers, computing rev(i) by table lookup is fast and practical. For large integers, however, 
table lookup is not practical and, as we have seen, computing rev(i) requires some 29 instructions. 


If table lookup cannot be used, it is more efficient to maintain i in both normal and bit-reversed forms, 
incrementing them both on each loop iteration. This raises the question of how best to increment an integer that 
is in a register in reversed form. To illustrate, on a 4-bit machine we wish to successively step through the 
values (in hexadecimal) 


0, 8, 4, C; 2; Ay G, E, 1; 9, 5, D; 3, B; Ty: F: 


In the FFT algorithm, i and its reversal are both some specific number of bits in length, almost certainly less 
than 32, and they are both right-justified in the register. However, we assume here that i is a 32-bit integer. 
After adding 1 to the reversed 32-bit integer, a shift right of the appropriate number of bits will make the result 
usable by the FFT algorithm (both i and rev(i) are used to index an array in memory). 


The straightforward way to increment a reversed integer is to scan from the left for the first 0-bit, set it to 1, and 
set all bits to the left of it (if any) to O's. One way to code this is 


unsigned x, m; 


m 0x80000000; 
xX =x AM; 
if ((int)x >= 0) { 
do { 
m=m >> 1; 
xX =x Am; 
} while (x < m); 


This executes in three basic RISC instructions if X begins with a 0-bit, and four additional instructions for each 
loop iteration. Because X begins with a 0-bit half the time, with 10 (binary) one-fourth of the time, and so on, 
the average number of instructions executed is approximately 


| l l l 
= 4--4+8-—4 12:-+16-—+ l 
2004 ARET: 
l 2 3 4 
RHEE ) l 
2 4 8 #16 
= 7, 


(In the second line we added and subtracted 1, with the first 1 in the form 1/2 + 1/4 + 1/8 + 1/16 + .... This 
makes the series similar to the one analyzed on page 86.) The number of instructions executed in the worst 
case, however, is quite large (131). 


If number of leading zeros is available, adding 1 to a reversed integer may be done as follows: 
First execute: s & nlz(4.x) 
and then either; x <— x ®(0x80000000 = s$) 
or xe ((x = s) + 0x80000000) = s 


Either method requires five full RISC instructions and, to properly wrap around from OxFFFFFFFF to 0, 
requires that the shifts be modulo 64. (These formulas fail in this respect on the Intel x86 machines, because the 


shifts are modulo 32.) 


7-2 Shuffling Bits 


Another important permutation of the bits of a word is the "perfect shuffle" operation, which has applications in 
cryptography. There are two varieties, called the "outer" and "inner" perfect shuffles. They both interleave the 
bits in the two halves of a word in a manner similar to a perfect shuffle of a deck of 32 cards, but they differ in 
which card is allowed to fall first. In the outer perfect shuffle, the outer (end) bits remain in the outer positions, 
and in the inner perfect shuffle, bit 15 moves to the left end of the word (position 31). If the 32-bit word is 
(where each letter denotes a single bit) 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP, 


then after the outer perfect shuffle it is 


aAbB cCdD eEfF gGhH iIjJ KKLL mMnN oOpP, 


and after the inner perfect shuffle it is 


AaBb CcDd EeFf GgHh IiJj KkLL MmNn OoPp. 


Assume the word size W is a power of 2. Then the outer perfect shuffle operation can be accomplished with 
basic RISC instructions in log,(W/2) steps, where each step swaps the second and third quartiles of 


successively smaller pieces [GLS1]. That is, a 32-bit word is transformed as follows: 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP 
abcd efgh ABCD EFGH ijkl mnop IJKL MNOP 
abcd ABCD efgh EFGH ijkl IJKL mnop MNOP 
abAB cdCD efEF ghGH ijIJ kLKL mnMN opOP 
aAbB cCdD eEfF gGhH iIjJ kKLL mMnN oOpP 


Straightforward code for this is 


= (xX & 0X0000FF00) << (x >> 8) & OxODOOFFOO 


8 | & OXFFOQQOFF; 
= (x & OXOOFOOOFO) << 4 | (x >> 4) & OxQOFOQOFO 

2 | 

L | 


& @XFOQFFOOF; 
& @xC3C3C3C3; 
& @x99999999; 


= (Xx & OxOCOCOCOC) << (x >> 2) & OxOCOCOCOC 
= (X & 0X22222222) << (x >> 1) & 0xX22222222 


x x XxX X 
x XxX XxX XK 


which requires 42 basic RISC instructions. This can be reduced to 30 instructions, although at an increase from 
17 to 21 cycles on a machine with unlimited instruction-level parallelism, by using the exclusive or method of 
exchanging two fields of a register (described on page 40). All quantities are unsigned: 


t = (x ^ (x >> 8)) & OxOOOOFFOO; x =x AtA (t << 8); 
t = (x ^ (x >> 4)) & OxOOFOOOFO; x =x AtA (t << 4); 
t= (x^ (x >> 2)) & OxXOCOCOCOC; x =x A TA (t << 2); 
t= (x ^ (Xx >> 1)) & Ox22222222" x=X^tA(t << 1); 


The inverse operation, the outer unshuffle, is easily accomplished by performing the swaps in reverse order: 


t (x ^ (Xx >> 1)) & 0x22222222; x =x” TA (t << 1); 
t = (x^ (x >> 2)) & OxOCOCOCOC; Xx=XA^tA(t << 2); 
t = (x^ (x >> 4)) & OxOOFOOOFO; x =x AtA (t << 4); 
t = (x ^ (x >> 8)) & OXOOOOFFOO; x =x AtA (t << 8); 


Using only the last two steps of either of the above two shuffle sequences shuffles the bits of each byte 
separately. Using only the last three steps shuffles the bits of each halfword separately, and so on. Similar 
remarks apply to unshuffling, except by using the first two or three steps. 


To get the inner perfect shuffle, prepend to these sequences a step to swap the left and right halves of the 
register: 


x = (x >> 16) | (x << 16); 


(or use a rotate of 16 bit positions). The unshuffle sequence can be similarly modified by appending this line of 
code. 


Altering the transformation to swap the first and fourth quartiles of successively smaller pieces produces the bit 
reversal of the inner perfect shuffle. 


Perhaps worth mentioning is the special case in which the left half of the word X is all 0. In other words, we 
want to move the bits in the right half of X to every other bit position—that is, to transform the 32-bit word 


0000 0000 0000 0000 ABCD EFGH IJKL MNOP 


to 


OAOB OCOD OEOF OGOH OIOJ OKOL OMON OOOP. 


The outer perfect shuffle code can be simplified to do this task in 22 basic RISC instructions. The code below, 
however, does it in only 19 basic RISC instructions, at no cost in execution time on a machine with unlimited 
instruction-level parallelism (12 cycles with either method). This code does not require that the left half of word 
X be initially cleared. 


x = ((X & OXFFOO) << 8) | (x & OXxOOFF); 


x= ((xX << 4) | x) & OXOFOFOFOF; 
x = ((x << 2) | xX) © 0x33333333; 
x = (x << 1) | x) & 0x55555555; 


Similarly, for the inverse of this "half shuffle" operation (a special case of compress; see page 116), the outer 
perfect unshuffle code can be simplified to do the task in 26 or 29 basic RISC instructions, depending on 
whether or not an initial and operation is required to clear the bits in the odd positions. The code below, 
however, does it in only 18 or 21 basic RISC instructions, and with less execution time on a machine with 
unlimited instruction-level parallelism (12 or 15 cycles). 


0x55555555; // (If required.) 
>> 1) | Xx) & 0x33333333; 
>> 2) | x) & OxOFOFOFOF; 
>> 4) | x) & OxOQFFOOFF; 
>> 8) | x) & OxOOOOFFFF; 


x x XxX XxX X 
| 
~nn” 
NNN 
x X X X Ro 


7-3 Transposing a Bit Matrix 


The transpose of a matrix A is a matrix whose columns are the rows of A and whose rows are the columns of A. 
Here we consider the problem of computing the transpose of a bit matrix whose elements are single bits that are 
packed eight per byte, with rows and columns beginning on byte boundaries. This seemingly simple 
transformation is surprisingly costly in instructions executed. 


On most machines it would be very slow to load and store individual bits, mainly due to the code that would be 
required to extract and (worse yet) to store individual bits. A better method is to partition the matrix into 8x8 
submatrices, load each 8x8 submatrix into registers, compute the transpose of the submatrix in registers, and 
then store the 8x8 result in the appropriate place in the target matrix. This section first discusses the problem of 
computing the transpose of the 8x8 submatrix. 


It doesn't matter whether the matrix is stored in row-major or column-major order; computing the transpose 
consists of the same operations in either event. Assuming for discussion that it's in row-major order, an 8x8 
submatrix is loaded into eight registers with eight load byte instructions, addressing a column of the source 
matrix. That is, the addresses referenced by the load byte instructions are separated by multiples of the source 
matrix width in bytes. After the transpose of the 8x8 submatrix is computed, it is stored in a column of the 
target matrix—that is, it is stored with eight store byte instructions into locations separated by multiples of the 
width of the target matrix in bytes (which is different from the width of the source matrix if the matrices are not 
square). Thus, we are given eight 8-bit quantities right-justified in registers a0, a1, ..., a7, and we wish to 


compute eight 8-bit quantities right-justified in registers bO, b1, ..., 07, for use in the store byte instructions. 


This is illustrated below, where each digit and letter represents a single bit. Notice that we consider the main 
diagonal to run from bit 7 of byte 0 to bit 0 of byte 7. Some readers with a little-endian background may be 
accustomed to thinking of the main diagonal as running from bit 0 of byte 0 to bit 7 of byte 7. 


a0 = 0123 4567 bO = 08go wEMU 
al = 89ab cdef b1 = 19hp xFNV 
a2 = ghij klmn b2 = 2aiq yGOW 
a3 = opgr stuv ==> b3 = 3bjr zZHPX 
a4 = wxyz ABCD b4 = 4cks AIQY 
a5 = EFGH IJKL b5 = 5dlt BJRZ 
a6 = MNOP QRST b6 = 6emu CKS$ 
a7 = UVWX YZ$. b7 = 7fnv DLT. 


The straightforward code for this problem is to select and place each result bit individually, as follows. The 
multiplications and divisions represent left and right shifts, respectively. 


bo = (a0 & 128) | (al & 128)/2 | (a2 &128)/4 | (a3 & 128)/8 | 
(a4 & 128)/16 | (a5 & 128)/32 | (a6 & 128)/64 | 
(a7 )/128; 


b1 = (a0 & 64)*2 | (al & 64) | (a2 & 64)/2 | (a3 & 64)/4 | 


(a4 
b2 = (a0 
(a4 
b3 = (a0 
(a4 
b4 = (a0 
(a4 
b5 = (a0 
(a4 
b6 = (a0 
(a4 
b7 = (a0 
(a4 


Ro Ro Ro Ro Ro Ro Ro Ko Ko Ko ko 
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64)/64; 
32)/2 | 
32)/32; 
16) | 
16)/16; 
8)"2 | 
8)/8; 
4)*4 | 
4)/4; 
2)*8 | 
2)/2; 
1)*16| 
1); 


This executes in 174 instructions on most machines (62 and's, 56 shift's, and 56 or's). The or's can of course be 
add's. On PowerPC it can be done, perhaps surprisingly, in 63 instructions (seven move register's and 56 rotate 
left word immediate then mask insert's). We are not counting the load byte and store byte instructions, nor their 
addressing code. 


Although there does not seem to be a really great algorithm for this problem, the method to be described beats 
the straightforward method by more than a factor of 2 on a basic RISC machine. 


First, treat the 8x8-bit matrix as 16 2x2-bit matrices, and transpose each of the 16 2x2-bit matrices. Second, 
treat the matrix as four 2x2 submatrices whose elements are 2x2-bit matrices and transpose each of the four 2x2 
submatrices. Finally, treat the matrix as a 2x2 matrix whose elements are 4x4-bit matrices, and transpose the 
2x2 matrix. These transformations are illustrated below. 


0123 4567 
89ab cdef 
ghij klmn 
opqr stuv 


wxyz ABCD 
EFGH IJKL 
MNOP QRST 
UVWX YZ$. 


==> 


082a 
193b 
goiq 
hpjr 


wEyG 
XFZH 
MUOW 
NVPX 


4c6e 
5d7f 
ksmu 
ltnv 


AICK 
BJDL 
QYS$ 
RZT. 


= 


08go 
19hp 
2aiq 
SDF 


wEMU 
xFNV 
yGOW 
ZHPX 


4cks 
5d1lt 
6emu 
7fnv 


AIQY 
BJRZ 
CKS$ 
DLT. 


==> 


08go wEMU 
19hp xFNV 
2aiq yGOW 
3bjr ZHPX 


4cks AIQY 
5alt BIRZ 
6emu CKS$ 
7fnv DLT. 


Rather than carrying out these steps on the eight individual bytes in eight registers, a net improvement results 
from first packing the bytes four to a register, performing the bit-swaps on the two registers, and then 
unpacking. A complete procedure is shown in Figure 7-2. Parameter A is the address of the first byte of an 8x8 


submatrix of the source matrix, which is of size 8mx8n bits. Similarly, parameter B is the address of the first 


byte of an 8x8 submatrix in the target matrix, which is of size 8nx8m bits. That is, the full source matrix is 


8mxn bytes, and the full target matrix is 8nxm bytes. 
Figure 7-2 Transposing an 8x8-bit matrix. 
void transpose8(unsigned char A[8], int m, int n, 
unsigned char B[8]) { 
unsigned x, y, t; 


// Load the array and pack it into x and y. 


x = (A[0]<<24) | (A[m]<<16) | (A[2*m]<<8) | A[3*m]; 
y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 


ct 
II 
~ 
x< 
> 
~ 
x< 


>> 7)) & OxOOAADOAA; xX =XATtLA (t << 7); 
t= (y ^ (y >> 7)) & OxOOBAADOAA; y=y At” (t < 7); 


t = (x ^ (x >>14)) & OXOOOOCCCC; x =x At ^ (t <<14); 
t = (y ^ (y >>14)) & OxXOOOOCCCC; y =y^ t ^ (t <<14); 


t = (x & OxFOFOFOFO) | ((y >> 4) & OXOFOFOFOF); 
y = ((x << 4) & OxFOFOFOFO) | (y & OXOFOFOFOF); 
: 


= t; 
B[O]=x>>24; B[n]=x>>16; B[2*n]=x>>8; B[3*n]=x; 
B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y; 
I 
The line 


t = (x ^ (x >> 7)) & OxOBAABOAA; X=X^tA(t << 7); 


is quite cryptic, for sure. It is swapping bits 1 and 8 (counting from the right), 3 and 10, 5 and 12, and so on, in 
word X, while not moving bits 0, 2, 4, and so on. The swaps are done with the exclusive or method of bit 


swapping, described on page 40. Word X, before and after the first round of swaps, is 


0123 4567 89ab cdef ghij klmn opqr stuv 
082a 4c6e 193b 5d7f goig ksmu hpjr ltnv 


To get a realistic comparison of these methods, the naive method described on page 109 was filled out into a 
complete program similar to that of Figure 7-2. Both were compiled with the GNU C compiler to a target 
machine that is very similar to the basic RISC. The resulting number of instructions, counting all load's, 
store's , addressing code, prologs, and epilogs, is 219 for the naive code and 101 for Figure 7-2. (The prologs 


and epilogs were null, except for a return branch instruction.) A version of the code of Figure 7-2 adapted to a 
64-bit basic RISC (in which X and y would be held in the same register) would be about 85 instructions. 


The algorithm of Figure 7-2 runs from fine to coarse granularity, based on the lengths of the groups of bits that 
are swapped. The method can also be run from coarse to fine granularity. To do this, first treat the 8x8-bit 
matrix as a 2x2 matrix whose elements are 4x4-bit matrices, and transpose the 2x2 matrix. Then treat each the 
four 4x4 submatrices as a 2x2 matrix whose elements are 2x2-bit matrices, and transpose each of the four 2x2 
submatrices, and so on. The code for this is the same as that of Figure 7-2 except with the three groups of 


statements that do the bit-rearranging run in reverse order. 
Transposing a 32x32-Bit Matrix 


The same recursive technique that was used for the 8x8-bit matrix can of course be used for larger matrices. For 
a 32x32-bit matrix it takes five stages. 


The details are quite different from Figure 7-2 because here we assume that the entire 32x32-bit matrix does not 
fit in the general register space, and we seek a compact procedure that indexes the appropriate words of the bit 
matrix to do the bit swaps. The algorithm to be described works best if run from coarse to fine granularity. 

In the first stage, treat the matrix as four 16x16-bit matrices, and transform it as follows: 


A B| |4 C| 
C D B D 


A denotes the left half of the first 16 words of the matrix, B denotes the right half of the first 16 words, and so 
on. It should be clear that the above transformation may be accomplished by the following swaps: 


Right half of word 0 with the left half of word 16, 


Right half of word 1 with the left half of word 17, 


Right half of word 15 with the left half of word 31. 


To implement this in code, we will have an index k that ranges from 0 to 15. In a loop controlled by k, the right 
half of word k will be swapped with the left half of word k + 16. 


In the second stage, treat the matrix as 16 8x8-bit matrices, and transform it as follows: 


ABCD AE CG 
EF GH |8 r DA| 
LJ AL IMKO 
MfN OP / NEP 


This transformation may be accomplished by the following swaps: 
Bits OxOOFFOOFF of word 0 with bits OxFFOOFFOO of word 8, 
Bits OxOOFFOOFF of word 1 with bits OxFFOOFFOO of word 9, and so on. 


This means that bits 0-7 (the least significant eight bits) of word 0 are swapped with bits 8-15 of word 8, and so 
on. The indexes of the first word in these swaps are k = 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23. A 
way to step k through these values is 


k = (kK+ 9) & 8. 


In the loop controlled by k, bits of word k are swapped with bits of word k +8. 
Similarly, the third stage does the following swaps: 

Bits OxOFOFOFOF of word 0 with bits OxFOFOFOFO of word 4, 

Bits OxOFOFOFOF of word 1 with bits OxFOFOFOFO of word 5, and so on. 


The indexes of the first word in these swaps are k = 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27. A way 
to step k through these values is 


k = (k+5)& 4. 


In the loop controlled by k, bits of word k are swapped with bits of word k + 4. 


These considerations are coded rather compactly in the C function shown in Figure 7-3 [GLS1]. The outer loop 
controls the five stages, with j taking on the values 16, 8, 4, 2, and 1. It also steps the mask m through the 
values OxOOOOFFFF, OxOOFFOOFF, OxOFOFOFOF, 0x33333333, and 0x55555555. (The code for this, m = m ^ 
(m << j ), is a nice little trick. It does not have an inverse, which is the main reason this code works best for 
coarse to fine transformations.) The inner loop steps k through the values described above. The inner loop body 
swaps the bits of a [k | identified by mask m with the bits of a [k+J | shifted right j and identified by m, 
which is equivalent to the bits of a [k+J ] identified with the complement of m. The code for performing 
these swaps is an adaptation of the "three exclusive or" technique shown on page 39 column (c). 


Figure 7-3 Compact code for transposing a 32x32-bit matrix. 


void transpose32(unsigned A[32]) { 
int j, k; 
unsigned m, t; 


m = 0X0000FFFF; 
for (1J = 16; J {= 0; j=] >> 1, m=m^(m<< J))4 
for (k = 0; k < 32; k= (k+j+1)&-~-j) 4 
t = (A[k] ^ (A[Kk+]] >> J)) & m; 
A[k] = A[k] A t; 
AL kt] ] = A[kt]] ^t << J); 


Based on compiling this function with the GNU C compiler to a machine very similar to the basic RISC, this 
compiles into 31 instructions, with 20 in the inner loop and 7 in the outer loop but not in the inner loop. Thus, it 
executes in 4 + 5(7 + 16 - 20) = 1639 instructions. In contrast, if this function were performed using 16 calls on 
the 8x8 transpose program of Figure 7-2, then it would take 16(101 + 5) = 1696 instructions, assuming the 16 
calls are "strung out." This includes five instructions for each function call (observed in compiled code). Thus, 
the two methods are, on the surface anyway, very nearly equal in execution time. 


On the other hand, for a 64-bit machine the code of Figure 7-3 can easily be modified to transpose a 64x64-bit 
matrix, and it would take about 4 + 6(7 + 32 - 20) = 3886 instructions. Doing the job with 64 executions of the 
8x8 transpose method would take about 64(85 + 5) = 5760 instructions. 


The algorithm works in place, and thus if it is used to transpose a larger matrix, additional steps are required to 
move 32x32-bit submatrices. It can be made to put the result matrix in an area distinct from the source matrix 
by separating out either the first or last execution of the "for j-loop" and having it store the result in the other 
area. 


About half the instructions executed by the function of Figure 7-3 are for loop control, and the function loads 


and stores the entire matrix five times. Would it be reasonable to reduce this overhead by unrolling the loops? It 
would, if you are looking for the ultimate in speed, if memory space is not a problem, if your machine's I- 
fetching can keep up with a large block of straight-line code, and, especially if the branches or loads are costly 
in execution time. The bulk of the program will be the six instructions that do the bit swaps repeated 80 times 
(5 - 16). In addition, the program will need 32 load instructions to load the source matrix and 32 store 
instructions to store the result, for a total of at least 544 instructions. 


Our GNU C compiler will not unroll loops by such large factors (16 for the inner loop, five for the outer loop). 
Figure 7-4 outlines a program in which the unrolling is done by hand. This program is shown as not working in 
place, but it executes correctly in place, if that is desired, by invoking it with identical arguments. The number 
of "swap" lines is 80. Our GNU C compiler for the basic RISC machine compiles this into 576 instructions 
(branch-free, except for the function return), counting prologs and epilogs. This machine does not have the 
store multiple and load multiple instructions, but it can save and restore registers two at a time with store 
double and load double instructions. 


Figure 7-4 Straight-line code for transposing a 32x32-bit matrix. 


#define swap(a0, ai, j, m) t = (a0 ^ (a1 >>j)) & m; \ 
a0 = a0 At; \ 
al = a1 ^ (t << j); 


void transpose32(unsigned A[32], unsigned B[32]) { 
unsigned m, t; 
unsigned a0, ai, a2, a3, a4, ab, a6, a7, 
a8, a9, a10, a11, a12, ails, a14, ais, 
a16, a17, als, a19, a20, a21, a22, a23, 
a24, a25, a26, a27, a28, a29, a30, a31; 


a0 =A[ 0O]; al =A[ 1]; a2 =A[ 2]; a3 A[ 3]; 
a4 =A[ 4]; a5 =A[ 5]; a6 =A[ 6]; a7 =A[7]; 


a28 = A[28]; a29 A[29]; a30 = A[30]; a31 = A[31]; 
m = 0x0000FFFF; 

swap(a0, a16, 16, m) 

swap(a1, a17, 16, m) 


swap(a15, a31, 16, m) 
m = OxOOFFOOFF; 

swap(a0, a8, 8, m) 
Swap(ail, a9, 8, m) 


swap(a28, a29, 1, m) 
swap(a30, a31, 1, m) 


B[ 0] = a0; B[ 1] = a1; B[ 2] 
B[ 4] a4; BL 5] = a5; B[ 6] 


a2; BL 3] = a3; 
a6; B[ 7] ar 


B[28] = a28; B[29] = a29; B[30] = a30; B31] = ast: 


There is a way to squeeze a little more performance out of this if your machine has a rotate shift instruction 
(either left or right). The idea is to replace all the swap operations of Figure 7-4, which take six instructions 


each, with simpler swaps that do not involve a shift, which take four instructions each (use the swap macro 
given, with the shifts omitted). 


First, rotate right words A [16..31] (that is, A [k] for 16 Sk S31) by 16 bit positions. Second, swap the right 
halves of A[0] with A[16], A[1] with A[17], and so on, similarly to the code of Figure 7-4. Third, rotate right 
words A[0..8] and A[24..31] by eight bit positions, and then swap the bits indicated by a mask of OxOOFFOOFF 
in words A[0] and A[8], A[1] and A[9], and so on, as in the code of Figure 7-4. After five stages of this, you 
don't quite have the transpose. Finally, you have to rotate left word A[1] by one bit position, A[2] by two bit 


positions, and so on (31 instructions). We do not show the code, but the steps are illustrated below for a 4x4-bit 
matrix. 


abed abcd abij abij aeim aeim 
efgh paes efghi __. efmn = rie fin — nbt j moms bit 
ijkl klij kled kled koct cqko 
mop opmn opgh hopg hipd dhip 


The bit-rearranging part of the program of Figure 7-4 requires 480 instructions (80 swaps at six instructions 
each). The revised program, using rotate instructions, requires 80 swaps at four instructions each, plus 80 rotate 
instructions (16 - 5) for the first five stages, plus a final 31 rotate instructions, for a total of 431 instructions. 
The prolog and epilog code would be unchanged, so using rotate instructions in this way saves 49 instructions. 


There is another quite different method of transposing a bit matrix: apply three shearing transformations 
[GLS1]. If the matrix is nxn, the steps are (1) rotate row i to the right i bit positions, (2) rotate column j 


upwards (j + 1) mod n bit positions, (3) rotate row i to the right (i + 1) mod n bit positions, and (4) reflect the 
matrix about a horizontal axis through the midpoint. To illustrate, for a 4x4-bit matrix: 


abed abcd hlpa dhlp aeim 
eigh -> hefg ___ kocg — egko ___ bjn 
ijkl ~ klij © nbfj °° bfin ~” egko 


mnog nopm aeim acim dhip 


This method is not quite competitive with the others because step (2) is costly. (To do it at reasonable cost, 
rotate upwards all columns that rotate by n/2 or more bit positions by n/2 bit positions [these are columns n/2 - 
1 through n - 2], then rotate certain columns upwards n/4 bit positions, and so on.) Steps 1 and 3 require only n 
- 1 instructions each, and step 4 requires no instructions at all if the results are simply stored to the appropriate 
locations. 


If an 8x8-bit matrix is stored in a 64-bit word in the obvious way (top row in the most significant eight bits, and 
so on), then the matrix transpose operation is equivalent to three outer perfect shuffles or unshuffles [GLS1]. 
This is a very good way to do it if your machine has shuffle or unshuffle as a single instruction, but it is not a 
good method on a basic RISC machine. 


7- 4 Compress, or Generalized Extract 


The APL language includes an operation called compress, written B/V, where B is a Boolean vector and V is 
vector of the same length as B, with arbitrary elements. The result of the operation is a vector consisting of the 
elements of V for which the corresponding bit in B is 1. The length of the result vector is equal to the number 
of 1's in B. 

Here we consider a similar operation on the bits of a word. Given a mask m and a word x, the bits of x for 
which the corresponding mask bit is 1 are selected and moved ("compressed") to the right. For example, if the 
word to be compressed is (where each letter denotes a single bit) 


abcd efgh ijkl mnop qrst uvwx yZAB CDEF, 


and the mask is 


0000 1111 0011 0011 1010 1010 0101 0101, 


then the result is 


0000 0000 0000 0000 efgh klop qsuw zBDF. 


This operation might also be called generalized extract, by analogy with the extract instruction found on many 
computers. 


We are interested in code for this operation with minimum worst-case execution time, and offer the simple loop 
of Figure 7-5 as a straw man to be improved upon. This code has no branches in the loop, and it executes in 260 
instructions worst case, including the subroutine prolog and epilog. 


Figure 7-5 A simple loop for the compress operation. 


unsigned compress(unsigned x, unsigned m) { 


unsigned r, s, b; // Result, shift, mask bit. 
r = 0; 
s = 0; 
do { 
b=m& 1; 
r=r | (008 b) << s); 
s = s + b; 
x = x >> 1; 
m = m >> 1; 


I 


} while (m != O); 
return r; 


It is possible to improve on this by repeatedly using the parallel prefix method (see page 75) with the exclusive 
or operation [GLS1]. We will denote the parallel prefix operation by PP-XOR. The basic idea is to first identify 
the bits of argument x that are to be moved right an odd number of bit positions, and move those. (This 
operation is simplified if x is first anded with the mask, to clear out irrelevant bits.) Mask bits are moved in the 
same way. Next, we identify the bits of x that are to be moved an odd multiple of 2 positions (2, 6, 10, and so 
on), and we then move these bits of x and the mask. Next, we identify and move the bits that are to be moved 
an odd multiple of 4 positions, then those that move an odd multiple of 8, and then those that move 16 bit 
positions. 


Because this algorithm, believed to be original with [GLS1], is a bit difficult to understand, and because it is 


perhaps surprising that something along these lines can be done at all, we will describe its operation in some 
detail. Suppose the inputs are 


x 
m 


abcd efgh ijkl mnop qrst uvwx yZAB CDEF, 
1000 1000 1110 0000 0000 1111 0101 0101, 
1 1 111 

9 6 333 4444 32 1090 


where each letter in X represents a single bit (with value 0 or 1). The numbers below each 1-bit in the mask m 
denote how far the corresponding bit of X must move to the right. This is the number of 0's in M to the right of 
the bit. As mentioned above, it is convenient to first clear out the irrelevant bits of X, giving 


x = a000 e000 ijkO 0000 0000 uvwx 0Z0B ODOF. 


The plan is to first determine which bits move an odd number of positions (to the right), and move those one bit 
position. Recall that the PP-XOR operation results in a 1-bit at each position where the number of 1's at and to 
the right of that position is odd. We wish to identify those bits for which the number of 0's strictly to the right is 
odd. This can be done by computing mk = ~m << 1 and performing PP-XOR on the result. This gives 


mk 
mp 


1110 1116 0011 1111 1110 0001 0101 6100; 
1010 0101 1110 1010 1010 0000 1100 1100. 


Observe that mk identifies the bits of m that have a 0 immediately to the right, and mp sums these, modulo 2, 
from the right. Thus, mp identifies the bits of m that have an odd number of 0's to the right. 


The bits that will be moved one position are those that are in positions that have an odd number of 0's strictly to 
the right (identified by mp) and that have a 1-bit in the original mask. This is simply mv = mp & m: 


mv = 1000 0000 1110 0000 0000 0000 0100 0100. 
These bits of m may be moved with the assignment 

m= (m ^ mv) | (mv >> 1); 

and the same bits of X may be moved with the two assignments 


t = x & mv; 
(xa t) | (tC >> 1); 


x< 
| 


(Moving the bits of m is simpler because all the selected bits are 1's.) Here the exclusive or is turning off bits 
known to be 1 in M and X, and the or is turning on bits known to be 0 in M and X. The operations could also, 


alternatively, both be exclusive or, or subtract and add, respectively. The results, after moving the bits selected 
by mV right one position, are: 


m 
X 


0100 1000 0111 0000 0000 1111 0011 0011, 
0a00 e000 Oijk 0000 0000 uvwx 00ZB OODF. 


Now we must prepare a mask for the second iteration, in which we identify bits that are to move an odd 
multiple of 2 positions to the right. Notice that the quantity mk & ~mp identifies those bits that have a 0 


immediately to the right in the original mask m, and those bits that have an even number of 0's to the right in the 
original mask. These properties apply jointly, although not individually, to the revised mask m. (That is to say, 
mk identifies all the positions in the revised mask M that have a 0 to the immediate right and an even number of 


O's to the right.) This is the quantity that, if summed from the right with PP-XOR, identifies those bits that 
move to the right an odd multiple of 2 positions (2, 6, 10, and so on). Therefore, the procedure is to assign this 
quantity to mk and perform a second iteration of the above steps. The revised value of mk is 


mk = 0100 1010 0001 0101 0100 0001 0001 0000. 


A complete C function for this operation is shown in Figure 7-6. It does the job in 127 basic RISC instructions 
(constant), including the subroutine prolog and epilog. Figure 7-7 shows the sequence of values taken on by 
certain variables at key points in the computation, with the same inputs that were used in the discussion above. 
Observe that a by-product of the algorithm, in the last value assigned to M, is the original m with all its 1-bits 


compressed to the right. 


Figure 7.6 Parallel prefix method for the compress operation. 


unsigned compress(unsigned x, unsigned m) { 
unsigned mk, mp, mv, t; 


int i; 


xX =X & m} // Clear irrelevant bits. 
mk = ~m << 1; // We will count O's to right. 
for (i = 0; i < 5; i++) { 
mp = mk ^ (mk << 1); // Parallel prefix. 
mp = mp ^ (mp << 2); 
mp = mp ^ (mp << 4); 
mp = mp ^ (mp << 8); 
mp = mp ^ (mp << 16); 
mv = mp & m; // Bits to move. 
m=m” mv | (mv >> (1 << i)); // Compress m. 
t = x & mv; 
Kast | (t >> OC =e. 1il) // Compress x. 
mk = mk & ~mp; 
} 
return x; 


} 


Figure 7.7 Operation of the parallel prefix method for the compress operation. 


x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
m = 1000 1000 1110 0000 0000 1111 0101 0101 
x = a000 e000 1jkO 0000 0000 uvwx OZOB ODOF 


i = 0, mk = 1110 1110 0011 1111 1110 0001 0101 0100 
After PP, mp = 1010 0101 1110 1010 1010 0000 1100 1100 
mv = 1000 0000 1110 0000 0000 0000 0100 0100 


m = 0100 1000 0111 0000 0000 1111 0011 0011 
x = 0a00 e000 Oijk 0000 0000 uvwx 00ZB OODF 
i = 1; mk 0100 1010 0001 0101 0100 0001 0001 0000 


After PP, mp 1100 0110 0000 1100 1100 0000 1111 0000 


3 
< 
Hot We out ot 


0100 0000 0000 0000 0000 0000 0011 0000 

m 0001 1000 0111 0000 0000 1111 0000 1111 

xX 000a e000 Oijk 0000 0000 uvwx 0000 zBDF 

i= 2, mk 0000 1000 0001 0001 0000 0001 0000 0000 


After PP, mp 0000 0111 1111 0000 1111 1111 0000 0000 
mv = 0000 0000 0111 0000 0000 1111 0000 0000 
m = 0001 1000 0000 0111 0000 0000 1111 1111 


x = 000a e000 0000 Oijk 0000 0000 uvwx ZBDF 


i = 3, mk = 0000 1000 0000 0001 0000 0000 0000 0000 
After PP, mp = 0000 0111 1111 1111 0000 0000 0000 0000 
mv = 0000 0000 0000 0111 0000 0000 0000 0000 

m = 0001 1000 0000 0000 0000 0111 1111 1111 

x = 000a e000 0000 0000 0000 Oijk uvwx zBDF 


i = 4, mk = 0000 1000 0000 0000 0000 0000 0000 0000 
After PP, mp = 1111 1000 0000 0000 0000 0000 0000 0000 
mv = 0001 1000 0000 0000 0000 0000 0000 0000 

m = 0000 0000 0000 0000 0001 1111 1111 1111 

x = 0000 0000 0000 0000 000a eijk uvwx zBDF 


We calculate that the algorithm of Figure 7-6 would execute in 169 instructions on a 64-bit basic RISC, as 
compared to 516 (worst case) for the algorithm of Figure 7-5. 


The number of instructions required by the algorithm of Figure 7-6 can be reduced substantially if the mask m 
is a constant. This can occur in two situations: (1) a call to "compress(x, m)" occurs in a loop, in which 
the value of M is not known but it is a loop constant, and (2) the value of M is known and the code for 
compress is generated in advance, perhaps by a compiler. 


Notice that the value assigned to X in the loop in Figure 7-6 is not used in the loop for anything other than the 
assignment to X. And, X is dependent only on itself and variable MV. Therefore, the subroutine can be coded 
with all references to X deleted, and the five values computed for mV can be saved in variables MVO, MV1, ..., 
mv4. Then, in situation (1) the function without references to X can be placed outside the loop in which 
"compress(xX, m)" occurs, and the following statements can be placed in the loop: 


xX =X & Mm; 

t=x &mvO; x=x At | (t >> 1); 
t =x& mi; x=x At | (t >> 2); 
t=x &mv2; x=x At | (t >> 4); 
t =x &mv3; x=x At | (t >> 8); 
t = x & m4; x=x At | (t >> 16); 


This is only 21 instructions in the loop (the loading of the constants can be placed outside the loop), a 
considerable improvement over the 127 required by the full subroutine of Figure 7-7. 


In situation (2), in which the value of m is known, the same sort of thing can be done, and further optimization 


may be possible. It might happen that one of the five masks is 0, in which case one of the five lines shown 
above can be omitted. For example, mask m1 is 0 if it happens that no bit moves an odd number of positions, 


and m4 is 0 if no bit moves more than 15 positions, and so on. 


As an example, for 


m = 0101 0101 0101 0101 0101 0101 0101 0101, 


the calculated masks are 


mvO = 0100 0100 0100 0100 0100 0100 0100 0100 
mv1 = 0011 0000 0011 0000 0011 0000 0011 0000 
mv2 = 0000 1111 0000 0000 0000 1111 0000 0000 
mv3 = 0000 0000 1111 1111 0000 0000 0000 0000 
mv4 = 0000 0000 0000 0000 0000 0000 0000 0000 


Because the last mask is 0, in the compiled code situation this compression operation is done in 17 instructions 
(not counting the loading of the masks). This is not quite as good as the code shown for this operation on page 
108 (13 instructions, not counting the loading of masks), which takes advantage of the fact that alternate bits 
are being selected. 


Using Insert and Extract 


If your computer has the insert instruction, preferably with immediate values for the operands that identify the 
bit field in the target register, then in the compiled situation insert can often be used to do the compress 
operation with fewer instructions than the methods discussed above. Furthermore, it doesn't tie up registers 
holding the masks. 


The target register is initialized to 0, and then, for each contiguous group of 1's in the mask M, variable X is 
shifted right to right-justify the next field, and the insert instruction is used to insert the bits of X in the 


appropriate place in the target register. This does the operation in 2n + 1 instructions, where n is the number of 
fields (groups of consecutive 1's) in the mask. The worst case is 33 instructions, because the maximum number 
of fields is 16 (which occurs for alternating 1's and 0's). 


An example in which the insert method uses substantially fewer instructions is M = 0x0010084A. Compressing 
with this mask requires moving bits 1, 2, 4, 8, and 16 positions. Thus, it takes the full 21 instructions for the 
parallel prefix method, but only 11 instructions for the insert method (there are five fields). A more extreme 
case is M = 0x80000000. Here a single bit moves 31 positions, requiring 21 instructions for the parallel prefix 
method, but only three instructions for the insert method and only one instruction (shift right 31) if you are not 
constrained to any particular scheme. 


You can also use the extract instruction in various simple ways to do the compress operation with a known 
mask in 3n - 2 instructions, where n is the number of fields in the mask. 


Clearly, the problem of compiling optimal code for the compress operation with a known mask is a difficult 
one. 


Compress Left 


To compress bits to the left, obviously you can reverse the argument X and the mask, compress right, and 
reverse the result. Another way is to compress right and then shift left by pop(m ). These might be satisfactory 
if your computer has an instruction for bit reversal or population count, but if not, the algorithm of Figure 7-6 is 
easily adapted: Just reverse the direction of all the shifts except the two in the expressions 1 << i (eight to 
change). 


7-5 General Permutations, Sheep and Goats Operation 


To do general permutations of the bits in a word, or of anything else, a central problem is how to represent the 
permutation. It cannot be represented very compactly. Because there are 32! permutations of the bits in a 32-bit 


ce CHAR = hite 
word, at least [ log,(32!) | 118 bits, 
permutation out of the 32!. 


or three words plus 22 bits, are required to designate one 


One interesting way to represent permutations is closely related to the compression operations discussed in 
Section 7-4 [GLS1]. Start with the direct method of simply listing the bit position to which each bit moves. For 
example, for the permutation done by a rotate left of four bit positions, the bit at position 0 (the least significant 
bit) moves to position 4, 1 moves to 5, ..., 31 moves to 3. This permutation can be represented by the vector of 
32 5-bit indexes: 


00100 
00101 
11111 
00000 
00001 


00010 
00011 


Treating that as a bit matrix, the representation we have in mind is its transpose, except reflected about the off 
diagonal so the top row contains the least significant bits and the result uses little-endian bit numbering. This 
we store as five 32-bit words in array p: 


p[0] = 1010 1010 1010 1010 1010 1010 1010 1010 
p[1] = 1100 1100 1100 1100 1100 1100 1100 1100 
p[2] = 0000 1111 0000 1111 9000 1111 0000 1111 
p[3] = 0000 1111 1111 0000 0000 1111 1111 0000 
p[4] = 0000 1111 1111 1111 1111 0000 0000 0000 


Each bit of p [ O | is the least significant bit of the position to which the corresponding bit of X moves, each bit 
of p [1] is the next more significant bit, and so on. This is similar to the encoding of the masks denoted by mv 
in the previous section, except that MV applies to revised masks in the compress algorithm, not to the original 
mask. 


The compression operation we need compresses to the left all bits marked with 1's in the mask, and compresses 
[1] 
to the right all bits marked with 0's. This is sometimes called the "sheep and goats" operation (SAG), or 


"generalized unshuffle.” It can be calculated with 


[1] If big-endian bit numbering is used, compress to the left all bits marked with O's, and to the right all bits marked with 
T's. 


SAG(x, m) = compress_left(x, m) | compress(x, ~m). 


With SAG as a fundamental operation, and a permutation p described as above, the bits of a word X can be 
permuted by p in the following 15 steps: 


X = SAG(x, p[0]); 
p[1] = SAG(p[1], p[0]); 
p[2] = SAG(p[2], p[0]); 
p[3] = SAG(p[3], p[0]); 
pL4] = SAG(p[4], p[0]); 


x = SAG(X, p[1]); 
p[2] = SAG(p[2], p[1]); 
p[3] = SAG(p[3], p[1]); 
p[4] = SAG(p[4], p[1]); 


x = SAG(X, p[2]); 
p[3] = SAG(p[3], p[2]); 
p[4] = SAG(p[4], p[2]); 


x = SAG(x,  p[3]); 
p[4] = SAG(p[4], p[3]); 


x = SAG(X, p[4]); 


In these steps, SAG is used to perform a stable binary radix sort. Array p is used as 32 5-bit keys to sort the bits 
of X. In the first step, all bits of X for which p [O ] = 1 are moved to the left half of the resulting word, and all 
those for which p [ 0 ] = 0 are moved to the right half. Other than this, the order of the bits is not changed (that 


is, the sort is "stable"). Then all the keys that will be used for the next round of sorting are similarly sorted. The 
sixth line is sorting X based on the second least significant bit of the key, and so on. 


Similarly to the situation of compressing, if a certain permutation p is to be used on a number of words X, then 
a considerable savings results by precomputing most of the steps above. The permutation array is revised to 


p[1] = SAG(p[1], p[0]); 
p[2] = SAG(SAG(p[2], pLO]), p[1]); 


SAG(SAG(SAG(p[3], p[O]), p[1]), p[2]); 
SAG(SAG(SAG(SAG(p[4], p[O]), pl1]), pLl2]), pl3]); 


p[3] 
p[4] 


and then each permutation is done with 


= SAG(x, p[0]); 
= SAG(x, p[1]); 
SAG(x, p[2]); 
= SAG(x, p[3]); 
= SAG(x, p[4]); 


x <x X X X 
II 


A more direct (but perhaps less interesting) way to do general permutations of the bits in a word is to represent 
a permutation as a sequence of 32 5-bit indexes. 


The kth index is the bit number in the source from which the kth bit of the result comes. (This is a "comes 
from" list, whereas the SAG method uses a "goes to" list.) These could be packed six to a 32-bit word, thus 
requiring six words to hold all 32 bit indexes. An instruction can be implemented in hardware such as 


bitgather Rt,Rx,Ri, 


where register Rt is a target register (and also a source), register RX contains the bits to be permuted, and 
register Rİ contains six 5-bit indexes (and two unused bits). The operation of the instruction is 


te (t6) | a A AA A Ai 


In words, the contents of the target register are shifted left six bit positions, and six bits are selected from word 
x and placed in the vacated six positions of t. The bits selected are given by the six 5-bit indexes in word i, 
taken in left-to-right order. The bit numbering in the indexes could be either little- or big-endian, and the 
operation would probably be as described for either type of machine. 


To permute a word, use a sequence of six such instructions, all with the same Rt and RX, but different index 
registers. In the first index register of the sequence, only indexes i, and is are significant, as the bits selected by 


the other four indexes are shifted out of the left end of Rt. 


An implementation of this instruction would most likely allow index values to be repeated, so the instruction 
can be used to do more than permute bits. It can be used to repeat any selected bit any number of times in the 
target register. The SAG operation lacks this generality. 


It is not unduly difficult to implement this as a fast (e.g., one cycle) instruction. The bit selection circuit 


consists of six 32:1 MUX's. If these are built from five stages of 2:1 MUX's in today's technology (6 - 31 = 186 
MUxX's in all), the instruction would be faster than a 32-bit add instruction [MD]. 


Permuting bits has applications in cryptography, and the closely related operation of permuting subwords (e.g., 
permuting the bytes in a word) has applications in computer graphics. Both of these applications are more 
likely to deal with 64-bit words, or possibly with 128, than with 32. The SAG and bitgather methods apply with 
obvious changes to these larger word sizes. 


To encrypt or decrypt a message with the Data Encryption Standard (DES) algorithm requires a large number 
of permutation-like mappings. First, key generation is done, once per session. This involves 17 permutation- 
like mappings. The first, called "permuted choice 1," maps from a 64-bit quantity to a 56-bit quantity (it selects 
the 56 non-parity bits from the key and permutes them). This is followed by 16 permutation-like mappings 
from 56 bits to 48 bits, all using the same mapping, called "permuted choice 2." 


Following key generation, each block of 64 bits in the message is subjected to 34 permutation-like operations. 
The first and last operations are 64-bit permutations, one being the inverse of the other. There are 16 
permutations with repetitions that map 32-bit quantities to 48 bits, all using the same mapping. Finally, there 
are 16 32-bit permutations, all using the same permutation. The total number of distinct mappings is six. They 
are all constants and are given in [DES]. 


DES is obsolete, as it was proved to be insecure in 1998 by the Electronic Frontier Foundation, using special 
hardware. The National Institute of Standards and Technology (NIST) has endorsed a temporary replacement 
called Triple DES, which consists of DES run serially three times on each 64-bit block, each time with a 
different key (that is, the key length is 192 bits, including 24 parity bits). Hence it takes three times as many 
permutation operations as does DES to encrypt or decrypt. 


However, the "permanent" replacement for DES and Triple DES, the Advanced Encryption Standard 
(previously known as the Rijndael algorithm [AES]), involves no bit-level permutations. The closest it comes 
to a permutation is a simple rotation of 32-bit words by a multiple of 8-bit positions. Other encryption methods 
proposed or in use generally involve far fewer bit-level permutations than DES. 


To compare the two permutation methods discussed here, the bitgather method has the advantages of (1) 
simpler preparation of the index words from the raw data describing the permutation, (2) simpler hardware, and 
(3) more general mappings. The SAG method has the advantages of (1) doing the permutation in five rather 
than six instructions, (2) having only two source registers in its instruction format (which might fit better in 
some RISC architectures), (3) scaling better to permute a doubleword quantity, and (4) permuting subwords 
more efficiently. 


Item (3) is discussed in [LSY]. The SAG instruction allows for doing a general permutation of a two-word 
quantity with two executions of the SAG instruction, a few basic RISC instructions, and two full permutations 
of single words. The bitgather instruction allows for doing it by executing three full permutations of single 
words plus a few basic RISC instructions. This does not count preprocessing of the permutation to produce new 
quantities that depend only on the permutation. We leave it to the reader to discover these methods. 


Regarding item (4), to permute, for example, the four bytes of a word with bitgather requires executing six 
instructions, the same as for a general bit permutation by bitgather. But with SAG it can be done in only two 
instructions, rather than the five required for a general bit permutation by SAG. The gain in efficiency applies 


wae _ ,. | logan , 
even when the subwords are not a power of 2 in size; the number of steps required is [ we ence nis the 
number of subwords, not counting a possible non-participating group of bits that stays at one end or the other. 


[LSY] discusses the SAG and bitgather instructions (called "GRP" and "PPERM," respectively), other possible 
permutation instructions based on networks, and permuting by table lookup. 


7-6 Rearrangements and Index Transformations 


Many simple rearrangements of the bits in a computer word correspond to even simpler transformations of the 
coordinates, or indexes, of the bits [GLS1]. These correspondences apply to rearrangements of the elements of 


any one-dimensional array, provided the number of array elements is an integral power of 2. For programming 
purposes, they are useful primarily when the array elements are a computer word or larger in size. 


As an example, the outer perfect shuffle of the elements of an array A of size eight, with the result in array B, 
consists of the following moves: 


Ay > By: Ay > B; A, By; A, — Bs; 


Ay By; As > Bs; A, > Bs; A-— B,; 


Each B-index is the corresponding A-index rotated left one position, using a 3-bit rotator. The outer perfect 
unshuffle is of course accomplished by rotating right each index. Some similar correspondences are shown in 
Table 7-1. Here n is the number of array elements, "Isb" means least significant bit, and the rotations of indexes 
are done with a log n-bit rotator. 


Table 7-1. Rearrangements and Index Transformations 


| | Index Transformation 


Rearrangement Array Index, or Big-endian Bit | Little-endian Bit Numbering 
Numbering 
eversal Complement Complement 
it flip, or generalized reversal (page 102) |Exclusive or with a constant xclusive or with a constant 
otate left k positions Subtract k (mod n) dd k (mod n) 


otate right k positions dd k (mod n) Subtract k (mod n) 


pre perfect shuffle pots left one position fos right one position 


a perfect unshuffle _ right one position foe left one position 


Inner perfect shuffle Rotate left one, then complement Complement lsb, then rotate 


Inner perfect unshuffle Complement lsb, then rotate right [Rotate left one, then complement 


Transpose of an 8x8-bit matrix heldina Rotate (left or right) three Rotate (left or right) three 
64-bit word ositions positions 


fe unscramble Reverse bits Revere bits 


Chapter 8. Multiplication 


Multiword Multiplication 
High-Order Half of 64-Bit Product 
High-Order Product Signed from/to Unsigned 


Multiplication by Constants 


8-1 Multiword Multiplication 


This may be done with, basically, the traditional grade-school method. Rather than develop an array of partial 
products, however, it is more efficient to add each new row, as it is being computed, into a row that will 
become the product. 


If the multiplicand is m words, and the multiplier is n words, then the product occupies m + n words (or fewer), 
whether signed or unsigned. 


In applying the grade-school scheme, we would like to treat each 32-bit word as a single digit. This works out 
well if an instruction that gives the 64-bit product of two 32-bit integers is available. Unfortunately, even if the 
machine has such an instruction, it is not readily accessible from most high-level languages. In fact, many 
modern RISC machines do not have this instruction in part because it isn't accessible from high-level languages 
and thus would not often be used. (Another reason is that the instruction would be one of a very few that give a 
two-register result.) 


Our procedure is shown in Figure 8-1. It uses halfwords as the "digits." Parameter W gets the result, and U and 
V are the multiplier and multiplicand, respectively. Each is an array of halfwords, with the first halfword (w 
[0], u[0], and v [O ]) being the least significant digit. This is "little-endian" order. Parameters M and N are 
the number of halfwords in U and V, respectively. 


The picture below may help in understanding. There is no relation between M and N; either may be the larger. 


Ua- n-2 Uy Us 
xX Vn-1 «-: Yi Vø 
Womn-åì Wun-3 Wi Wo 


The procedure follows Algorithm M of [Knu2, sec. 4.3.1], but is coded in C and modified to perform signed 
multiplication. Observe that the assignment to t in the upper half of Figure 8-1 cannot overflow, because the 


maximum value that could be assigned to t is (216 - 1)2 + 2(216 - 1) = 232 - 1. 


Multiword multiplication is simplest for unsigned operands. In fact, the code of Figure 8-1 performs unsigned 
multiplication if the "correction" steps (the lines between the three-line comment and the "return" statement) 
are omitted. An unsigned version can be extended to signed in three ways: 


Figure 8-1 Multiword integer multiplication, signed. 


void mulmns(unsigned short w[], unsigned short u[], 
unsigned short v[], int m, int n) { 
unsigned int k, t, b; 
int i, j; 


for (i = 0; i < m; i++) 


w[i] = 0; 
ror (J= 07 J< m J) 

k = 0; 

for (i = 0; i < m; i++) { 
t = u[i]*v[j] + wii + J] + k; 
w[i + j] = t; // (I.e., t & OXFFFF). 
fst o> a6: 

} 


// Now w[] has the unsigned product. Correct by 
// subtracting v*2**16m if u < 0, and 
// subtracting u*2**16n if v < O. 


if ((short)u[m - 1] < 0) { 
b = 0; // Initialize borrow. 
for (J= 0; J =< n; je) { 
t =w[{j + mj - v[j] - b; 
w[j +m] = t; 
b = t >> 31; 


} 
} 
if ((short)v[n - 1] < 0) { 
b = 0; 
for (i = 0; i < m; i++) { 
t = w[i + n] - u[i] - b; 
w[i + n] = t; 
b = t >> 31; 
} 
} 
return; 


1. Take the absolute value of each input operand, perform unsigned multiplication, and then negate 


the result if the input operands had different signs. 


2. Perform the multiplication using unsigned elementary multiplication except when multiplying one 
of the high-order halfwords, in which case use signed x unsigned or signed x signed multiplication. 


3. Perform unsigned multiplication and then correct the result somehow. 


The first method requires passing over as many as m + n input halfwords, to compute their absolute value. Or, 
if one operand is positive and one is negative, the method requires passing over as many as max(m,n) +m+n 
halfwords, to complement the negative input operand and the result. Perhaps more serious, the algorithm would 
alter its inputs (which we assume are passed by address), which may be unacceptable in some applications. 
Alternatively, it could allocate temporary space for them, or it could alter them and later change them back. All 
these alternatives are unappealing. 


The second method requires three kinds of elementary multiplication (unsigned x unsigned, unsigned x signed, 
and signed x signed) and requires sign extension of partial products on the left, with 0's or 1's, making each 
partial product take longer to compute and add to the running total. 


We choose the third method. To see how it works, let u and v denote the values of the two signed integers being 
multiplied, and let them be of lengths M and N bits, respectively. Then the steps in the upper half of Figure 8-1 


erroneously interpret u as an unsigned quantity, having value u + 2Muy,_ 1, where uy - ; is the sign bit of u. 
That is, uy - 1 = 1 if u is negative, and uj _, = 0 otherwise. Similarly, the program interprets v as having value 


vt 2N UN -4- 
The program computes the product of these unsigned numbers—that is, it computes 


(u +2Muy vt 2v) = uvt Mug vt ley pet IM+ Nuy ye 


To get the desired result (uv), we must subtract from the unsigned product the value 2Muy _ 4 v + 2Nvy 4 u. 
There is no need to subtract the term 2M + Nuş _ 1 Vy - 1, because we know that the result can be expressed in M 
+ N bits, so there is no need to compute any product bits more significant than bit position M + N - 1. These two 


subtractions are performed by the steps below the three-line comment in Figure 8-1. They require passing over 
a maximum of m + n halfwords. 


It might be tempting to use the program of Figure 8-1 by passing it an array of fullword integers—that is, by 
"lying across the interface." Such a program will work on a little-endian machine, but not on a big-endian one. 
If we had stored the arrays in the reverse order, with U [ © | being the most significant halfword (and the 
program altered accordingly), the "lying" program would work on a big-endian machine, but not on a little- 
endian one. 


8-2 High-Order Half of 64-Bit Product 


Here we consider the problem of computing the high-order 32 bits of the product of two 32-bit integers. This is 
the function of our basic RISC instructions multiply high signed (mULNS) and multiply high unsigned 


(muLhu). 


For unsigned multiplication, the algorithm in the upper half of Figure 8-1 works well. Rewrite it for the special 
case m = n = 2, with loops unrolled, obvious simplifications made, and the parameters changed to 32-bit 
unsigned integers. 


For signed multiplication, it is not necessary to code the "correction steps" in the lower half of Figure 8-1. 
These can be omitted if proper attention is paid to whether the intermediate results are signed or unsigned 
(declaring them to be signed causes the right shifts to be sign-propagating shifts). The resulting algorithm is 
shown in Figure 8-2. For an unsigned version, simply change all the int declarations to unsigned. 


Figure 8-2 Multiply high signed. 


int mulhs(int u, int v) { 
unsigned u0, vO, w0; 
int u1, Vi, Wi, w2, t; 


uO = u & OXFFFF; ul = u >> 16; 
vO = v & OXFFFF; vil = v >> 16; 
wO = u0*v®O; 

t = ul*vO + (w0 >> 16); 

w1 = t & OXxFFFF; 

w2 = t >> 16; 

wi = uO*v1 + wi; 

return ul*vi + w2 + (wi >> 16); 


The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are 
multiplications. 


8-3 High-Order Product Signed from/to Unsigned 


Assume that the machine can readily compute the high-order half of the 64-bit product of two unsigned 32-bit 
integers, but we wish to perform the corresponding operation on signed integers. We could use the procedure of 
Figure 8-2, but that requires four multiplications; the procedure to be given [BGN] is much more efficient than 
that. 


The analysis is a special case of that done to convert Knuth's Algorithm M from an unsigned to a signed 
multiplication routine (Figure 8-1). Let x and y denote the two 32-bit signed integers that we wish to multiply 


together. The machine will interpret x as an unsigned integer, having the value x + 232x34, where x3; is the most 
significant bit of x (that is, x3, is the integer 1 if x is negative, and 0 otherwise). Similarly, y under unsigned 


interpretation has the value y + 234y3;. 


Although the result we want is the high-order 32 bits of xy, the machine computes 


(x + 232x5))(y + 2382y51) = xy + Baly + y 512) + 245,95). 


To get the desired result, we must subtract from this the quantity 234(x3,y + y3x) + 2®4x3;y31. Because we 


know that the result can be expressed in 64 bits, we can perform the arithmetic modulo 264. This means that we 
can safely ignore the last term, and compute the signed high-order product as shown below (seven basic RISC 
instructions). 


Equation 1 

p + mulhu(x, y) i multiply high unsigned instruction. 
Heliks 3h y Il t = Xay. 

fely) I ty = yax. 

pep-t,-t i p = desired result. 


Unsigned from Signed 


The reverse transformation follows easily. The resulting program is the same as (1) except with the first 


instruction changed to multiply high signed and the last operation changed to p p + t, + ty. 


8-4 Multiplication by Constants 


It is nearly a triviality that one can multiply by a constant with a sequence of shift left and add instructions. For 
example, to multiply x by 13 (binary 1101), one can code 


ierg 


fs = xr] 


where r gets the result. 


In this section, left shifts are denoted by multiplication by a power of 2, so the above plan is written r #—8x + 
4x + x, which is intended to show four instructions on the basic RISC and most machines. 


What we want to convey here is that there is more to this subject than meets the eye. First of all, there are other 
considerations besides simply the number of shift's and add's required to do a multiplication by a given 
constant. To illustrate, below are two plans for multiplying by 45 (binary 101101). 


tec4x i edx 
r&extt f — Bx 
telt f, e 32x 
r&rti ref) tx 
t<A4t f, =l +l, 
rert Ferri, 


The plan on the left uses a variable t that holds x shifted left by a number of positions that corresponds to a 1- 
bit in the multiplier. Each shifted value is obtained from the one before it. This plan has these advantages: 


e It requires only one working register other than the input x and the output r. 


e Except for the first two, it uses only 2-address instructions. 


e The shift amounts are relatively small. 
The same properties are retained when the plan is applied to any multiplier. 


The scheme on the right does all the shift's first, with x as the operand. It has the advantage of increased 
parallelism. On a machine with sufficient instruction-level parallelism, the scheme on the right executes in three 
cycles, whereas the scheme on the left, running on a machine with unlimited parallelism, requires four. 


In addition to these details, it is nontrivial to find the minimum number of operations to accomplish 
multiplication by a constant, where by an "operation" we mean an instruction from a typical computer's set of 
add and shift instructions. In what follows, we assume this set consists of add, subtract, shift left by any 
constant amount, and negate. We assume the instruction format is three-address. However, the problem is no 
easier if one is restricted to only add (adding a number to itself, and then adding the sum to itself, and so on, 
accomplishes a shift left of any amount), or if one augments the set by instructions such as the HP PA-RISC's 
shift and add instructions. (These shift the contents of a register left by one, two, or three positions, add it to a 
second register, and put the result in a third register. Thus, it can multiply by 3, 5, or 9 in a single, presumably 
fast, instruction.) We also assume that only the least significant 32 bits of the product are wanted. 


The first improvement to the basic binary decomposition scheme suggested above is to use subtract to shorten 
the sequence when the multiplier contains a group of three or more consecutive 1-bits. For example, to multiply 
by 28 (binary 11100), we can compute 32x - 4x (three instructions) rather than 16x + 8x + 4x (five 
instructions). On two's-complement machines, the result is correct even if the intermediate result of 32x 
overflows and the final result does not. 


To multiply by a constant m with the basic binary decomposition scheme (using only shift's and add's) requires 


2pop(m) — 1-8 


instructions, where 6 = 1 if m ends in a 1-bit (is odd), and 5 = 0 otherwise. If subtract is also used, it requires 


gim) + 2s(m)—1-6 


instructions, where g(m) is the number of groups of two or more consecutive 1-bits in m, s(m) is the number of 
"singleton" 1-bits in m, and 6 has the same meaning as before. 


For a group of size 2, it makes no difference which method is used. 


The next improvement is to treat specially groups that are separated by a single 0-bit. For example, consider m 


= 55 (binary 110111). The group method calculates this as (64x - 16x) + (8x - x), which requires six 
instructions. Calculating it as 64x - 8x - x, however, requires only four. Similarly, we can multiply by binary 
110111011 as illustrated by the formula 512x - 64x - 4x - x (six instructions). 


The formulas above give an upper bound on the number of operations required to multiply a variable x by any 
given number m. Another bound can be obtained based on the size of m in bits—that is, on 


i= | log, m] +1. 


Theorem. Multiplication of a variable x by an n-bit constant m, m >ı, can be accomplished with at most n 
instructions of the type add, subtract, and shift left by any given amount. 


Proof. (Induction on n.) Multiplication by 1 can be done in 0 instructions, so the theorem holds for n = 1. For n 
> 1, if m ends in a 0-bit, then multiplication by m can be accomplished by multiplying by the number consisting 
of the left n - 1 bits of m (that is, by m/2), in n - 1 instructions, followed by a shift left of the result by one 
position. This uses n instructions altogether. 


If m ends in binary 01, then mx can be calculated by multiplying x by the number consisting of the left n - 2 bits 
of m, in n - 2 instructions, followed by a left shift of the result by 2, and an add of x. This requires n instructions 
altogether. 


If m ends in binary 11, then consider the cases in which it ends in 0011, 0111, 1011, and 1111. Let t be the 
result of multiplying x by the left n - 4 bits of m. If m ends in 0011, then mx = 16t + 2x + x, which requires (n - 
4) + 4 = n instructions. If m ends in 0111, then mx = 16t + 8x - x, which requires n instructions. If m ends in 
1111, then mx = 16t + 16x - x, which requires n instructions. The remaining case is that m ends in 1011. 


It is easy to show that mx can be calculated in n instructions if m ends in 001011, 011011, or 111011. The 
remaining case is 101011. 


This reasoning can be continued, with the "remaining case" always being of the form 101010...10101011. 
Eventually, the size of m will be reached, and the only remaining case is the number 101010...10101011. This 
n-bit number contains n/2 + 1 1-bits. By a previous observation, it can multiply x with 2(n/2 + 1)-2=n 
instructions. 


Thus, in particular, on a 32-bit machine multiplication by any constant can be done in at most 32 instructions, 
by the method described above. By inspection, it is easily seen that for n even, the n-bit number 101010... 
101011 requires n instructions, and for n odd, the n-bit number 1010101...010110 requires n instructions, so 
the bound is tight. 


The methodology described so far is not too hard to work out by hand, or to incorporate into an algorithm such 
as might be used in a compiler. But such an algorithm would not always produce the best code, because further 
improvement is sometimes possible. This can result from factoring the multiplier m or some intermediate 
quantity along the way of computing mx. For example, consider again m = 45 (binary 101101). The methods 
described above require six instructions. Factoring 45 as 5 - 9, however, gives a four-instruction solution: 


fe4ay+x 
resite 


Factoring may be combined with additive methods. For example, multiplication by 106 (binary 1101010) 
requires seven instructions by the additive methods, but writing it as 7 - 15 + 1 leads to a five-instruction 
solution. 


With factoring, the maximum number of instructions needed to multiply by an n-bit constant is, to the writer's 
knowledge, an open problem. For large n it may be less than the bound of n proved above. For example, m = 
OxAAAAAAAB requires 32 instructions without factoring, but writing this value as 2-5- 17-257 - 65537 + 1 
gives a ten-instruction solution. (Ten instructions, however, is probably not typical of large numbers. The 
factorization reflects the simple bit pattern of alternate 1's and 0's.) 


This should give an idea of the combinatorics involved in this seemingly simple problem. Knuth [Knu2, sec. 
4.6.3] discusses the closely related problem of computing a” using a minimum number of multiplications. This 
is analogous to the problem of multiplying by m using only addition instructions. A compiler algorithm for 
computing mx is described in [Bern]. 


Chapter 9. Integer Division 


Preliminaries 
Multiword Division 
Unsigned Short Division from Signed Division 


Unsigned Long Division 


9-1 Preliminaries 


This chapter and the following one give a number of tricks and algorithms involving "computer division" of 
integers. In mathematical formulas we use the expression x/y to denote ordinary rational division, x + y to 


if 
denote signed computer division of integers (truncating toward 0), and * +F to denote unsigned computer 
division of integers. Within C code, X/Y of course denotes computer division, unsigned if either operand is 
unsigned, and signed if both operands are signed. 


Division is a complex process, and the algorithms involving it are often not very elegant. It is even a matter of 
judgment as to just how signed integer division should be defined. Most high-level languages and most 
computer instruction sets define the result to be the rational result truncated toward 0. This and two other 
possibilities are illustrated below. 


truncating modulus floor 
(#3 = 2 rem 1 2 rem 1 2 rem 1 
(<7 )+3 = -2 rem -1 -3 rem 2 -3 rem 2 
7+(-3) = -2 rem 1 -2 rem 1 -3 rem -2 
(-7)+(-3) = 2 rem -1 3 rem 2 2 rem -1 


The relation dividend = quotient * divisor + remainder holds for all three possibilities. We define "modulus" 
[1] 

division by requiring that the remainder be nonnegative. We define "floor" division by requiring that the 

quotient be the "floor" of the rational result. For positive divisors, modulus and floor division are equivalent. A 

fourth possibility, seldom used, rounds the quotient to the nearest integer. 


[t] | know | will be taken to task for this nomenclature, because there is no universal agreement that "modulus" implies 
"nonnegative." Knuth's "mod" operator [Knu1] is the remainder of floor division, which is negative (or 0) if the divisor is 


negative. Several programming languages use "mod" for the remainder of truncating division. However, in 
mathematics "modulus" is sometimes used for the magnitude of a comple x number (nonnegative), and in congruence 
theory the modulus is generally assumed to be positive. 


One advantage of modulus and floor division is that most of the tricks simplify. For example, division by 2” 
can be replaced by a shift right signed of n positions, and the remainder of dividing x by 2” is given by the 
logical and of x and 2” - 1. I suspect that modulus and floor division more often give the result you want. For 
example, suppose you are writing a program to graph an integer-valued function, and the values range from 
imin to imax. You want to set up the extremes of the ordinate to be the smallest multiples of 10 that include 
imin and imax. Then the extreme values are simply (imin + 10) * 10 and ((imax + 9) + 10) * 10 if modulus or 
floor division is used. If conventional division is used, you must evaluate something like: 


(imin/10)*10; 
((imin - 9)/10)*10; 
((imax + 9)/10)*10; 


if (imin >= 0) gmin 
else gmin 
if (imax >= 0) gmax 


else gmax = (imax/10)*10; 


Besides the quotient being more useful with modulus or floor division than with truncating division, we 
speculate that the nonnegative remainder is probably wanted more often than a remainder that can be negative. 


It is hard to choose between modulus and floor division, because they differ only when the divisor is negative, 
which is unusual. Appealing to existing highlevel languages does not help, because they almost universally use 
truncating division for X/Y when the operands are signed integers. A few give floating-point numbers, or 


rational numbers, for the result. Looking at remainders, there is confusion. In Fortran 90, the MOD function 
gives the remainder of truncating division and MODULO gives the remainder of floor division (which can be 


negative). Similarly, in Common Lisp and ADA, REM is the remainder of truncating division, and MOD is the 
remainder of floor division. In PL/I, MOD is always nonnegative (it is the remainder of modulus division). In 


Pascal, A mod B is defined only for B > 0, and then it is the nonnegative value (the remainder of either 
modulus or floor division). 


[2] 


Anyway, we cannot change the world even if we knew how we wanted to change it, so in what follows we 
will use the usual definition (truncating) for x + y. 


[2] Some do try. IBM's PL.8 language uses modulus division, and Knuth's MMIX machine's division instruction uses 
floor division [MMIX]. 


A nice property of truncating division is that it satisfies 


(-n)+d = n+(-d) = -—(n+d), for d#0. 


However, care must be exercised when applying this to transform programs, because if n or d is the maximum 
negative number, -n or -d cannot be represented in 32 bits. The operation (-231) + (-1) is an overflow (the result 
cannot be expressed as a signed quantity in two's-complement notation), and on most machines the result is 
undefined or the operation is suppressed. 


Signed integer (truncating) division is related to ordinary rational division by 


Equation 1 


lasd] if d#0,nd20, 


ned = . 
[nfd], if d#0,nd<0. 


Unsigned integer division—that is, division in which both n and d are interpreted as unsigned integers— 
satisfies the upper portion of (1). 


In the discussion that follows, we make use of the following elementary properties of arithmetic, which we 
don't prove here. See [Knu1] and [GKP] for interesting discussions of the floor and ceiling functions. 


Theorem D1. For x real, k an integer, 


Lx] = -| -xl [x] =-L-x] 


x-lelLxjsx xs[xlextl 
[x]ēxe|x]+]1 [x]—l<xs[x] 
xzkeatxj2k xekeol[x|sk 
x>k=|x]zk x<k=o[x]sk 
xsk-olLxj|skmxrek+1 x2k—o[x|zk—ox>k-| 
xckelxje<k x>eko[x]>h 


Theorem D2. For n, d integers, d > 0, 
il = [gH and |@|=|a+d-1)., 
al él él él 


Ifd <0: 


Hakr and zj]. 
d d d d 


Theorem D3. For x real, d an integer Æo: 


LlLxj]/d]= [x74] and [[x]/d] = [xd]. 


Corollary. For a, b real, b Ho, d an integer Ho, 


label= om livel] é 


Theorem D4. For n, d integers, d Æo, and x real, 


Feed = [5] if O<x< 


. and Hex)= vl if- 
iJ k [s] 


E 


<xs0. 
R 


In the theorems below, rem(n, d) denotes the remainder of n divided by d. For negative d, it is defined by rem 
(n, -d) = rem(n, d), as in truncating and modulus division. We do not use rem(n, d) with n < 0. Thus, for our 
use, the remainder is always nonnegative. 


Theorem D5. For n >o, d Ho, 


2rem(n,d)+1 or 
2rem(n, d)—|ad| + 1 


2 g , { cy 
rem(2n, dì = eee and rem(2n+1,d)= 
2rem(n, d) — ld, 


(whichever value is greater than or equal to 0 and less than |d|). 


Theorem D6. For n >o, d Æo, 


rem(2n, 2d) = 2rem(n, d). 


Theorems D5 and D6 are easily proved from the basic definition of remainder—that is, that for some integer q 
it satisfies 


n=gdt+rem(n,d) with 0 <rem(n, d)< |d, 


provided n =0 and d Æo (n and d can be non-integers, but we will use these theorems only for integers). 


9-2 Multiword Division 


As in the case of multiword multiplication, multiword division may be done by, basically, the traditional grade- 
school method. The details, however, are surprisingly complicated. Figure 9-1 is Knuth's Algorithm D [Knu2 


sec. 4.3.1], coded in C. The underlying form of division it uses is 32416 = 3 2.(Actually, the quotient of 
these underlying division operations is at most 17 bits long.) 


Figure 9-1 Multiword integer division, unsigned , . 


int divmnu(unsigned short q[], unsigned short r[], 
const unsigned short u[], const unsigned short v[], 
int m, int n) { 


const unsigned b = 65536; // Number base (16 bits). 
unsigned short *un, *vn; // Normalized form of u, v. 


unsigned ghat; // Estimated quotient digit. 
unsigned rhat; // A remainder. 
unsigned p; // Product of two digits. 


int S, i, Vy t; K; 


if (m < n || n <= © || v[n-1] == 0) 


return 1; // Return if invalid param. 
if (n = 1) { // Take care of 
k = 0; // the case of a 


for (j =m - 1; j >= 0; j--) { // single-digit 
q[j] = (k*b + u[j])/v[0]; // divisor here. 
k= (k*b + u[j]) = ah vel: 

} 


if (r != NULL) r[0] = k; 
return 0; 


// Normalize by shifting v left just enough so that 
// its high-order bit is on, and shift u left the 

// same amount. We may have to append a high-order 
// digit on the dividend; we do that unconditionally. 


s = nlz(v[n-1]) - 16; // 0 <= s <= 16. 
vn = (unsigned short *)alloca(2*n); 


for (i =n - 1; i > 0; i--) 
vn[i] = (v[i] << s) | (v[i-1] >> 16-s); 
vn[O] = v[O] << s; 


un = (unsigned short *)alloca(2*(m + 1)); 
un[m] = u[m-1] >> 16-s; 
for (1 =m - 1; 1 > 0; i--) 
unli] = (Ula) << s} | (Ulead >> 16-5); 
un[0] = u[0] << s; 
for (j =m - n; j > 0; j--) { // Main loop. 
// Compute estimate qhat of q[j]. 
qhat = (un[j+n]*b + un[j+n-1])/vn[n-1]; 
rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1]; 
again: 
if (ghat >= b || ghat*vn[n-2] > b*rhat + un[j+n-2]) 
{ ghat = ghat - 1; 
rhat = rhat + vn[n-1]; 
if (rhat < b) goto again; 


} 
// Multiply and subtract. 
k = 0; 


for (i = 0; i < n; i++) 4 
p = qhat*vn[i]; 
t = un[i+j] - k - (p & OxFFFF); 
un[i+j] = t; 
k = (p >> 16) - (t >> 16); 
} 
t = un[j+n] - k; 
un[j+n] = t; 


q[j] = ghat; // Store quotient digit. 

if (t < 0) { // If we subtracted too 
qa[j] = q[j] - 1; // much, add back. 
keg: 


for (i = 0; i < n; i++) { 
t = un[i+j] + vn[i] + k; 
un[i+j] = t; 
k = t >> 16; 


un[j+n] = un[j+n] + k; 


} // End j. 
// If the caller wants the remainder, unnormalize 
// it and pass it back. 
if (r != NULL) { 
for (i = 0; i < n; i++) 
r[i] = (un[i] >> s) | (un[i+1] << 16-s); 
} 


return 0; 


The algorithm processes its inputs and outputs a halfword at a time. Of course, we would prefer to process a 
fullword at a time, but it seems that such an algorithm would require an instruction that does 


64232 => 22 division, We assume here that either the machine does not have that instruction or it is hard 
to access from our high-level language. Although we generally assume the machine has 
32 £32 => 32 division, for this problem 32216 => 16 suffices. 


Thus, for this implementation of Knuth's algorithm, the base D is 65536. See [Knu2] for most of the 
explanation of this algorithm. 


The dividend U and the divisor V are in "little-endian" order—that is, u [9] andv [0] are the least 
significant digits. (The code works correctly on both big-and little-endian machines.) Parameters M and N are 
the number of halfwords in U and V, respectively (Knuth defines M to be the length of the quotient). The caller 
supplies space for the quotient Qq and, optionally, for the remainder r. The space for the quotient must be at 
least Mm - N + 1 halfwords, and for the remainder, N halfwords. Alternatively, a value of NULL can be given 
for the address of the remainder to signify that the remainder is not wanted. 


The algorithm requires that the most significant digit of the divisor, V [N - 1], be nonzero. This simplifies the 
normalization steps and helps to ensure that the caller has allocated sufficient space for the quotient. The code 
checks that V[ N- 1] is nonzero, and also the requirements that Nn = 1 and m =n. If any of these conditions 
are violated, it returns with an error code (return value 1). 


After these checks, the code performs the division for the simple case in which the divisor is of length 1. This 
case is not singled out for speed; the rest of the algorithm requires that the divisor be of length 2 or more. 


If the divisor is of length 2 or more, the algorithm normalizes the divisor by shifting it left just enough so that 
its high-order bit is 1. The dividend is shifted left the same amount, so the quotient is not changed by these 
shifts. As explained by Knuth, these steps are necessary to make it easy to guess each quotient digit with good 
accuracy. The number of leading zeros function, nlz(x), is used to determine the shift amount. 


In the normalization steps, new space is allocated for the normalized dividend and divisor. This is done because 
it is generally undesirable, from the caller's point of view, to alter these input arguments, and because it may be 


impossible to alter them—they may be constants in read-only memory. Furthermore, the dividend may need an 
additional high-order digit. C's "alloca" function is ideal for allocating this space. It is usually implemented 
very efficiently, requiring only two or three in-line instructions to allocate the space and no instructions at all to 
free it. The space is allocated on the program's stack, in such a way that it is freed automatically upon 
subroutine return. 


In the main loop, the quotient digits are cranked out one per loop iteration, and the dividend is reduced until it 
becomes the remainder. The estimate Ghat of each quotient digit, after being refined by the steps in the loop 


labelled again, is always either exact or too high by 1. 


The next steps multiply Ghat by the divisor and subtract the product from the current remainder, as in the 


grade school method. If the remainder is negative, it is necessary to decrease the quotient digit by 1 and either 
re-multiply and subtract or, more simply, adjust the remainder by adding the divisor to it. This need be done at 
most once, because the quotient digit was either exact or 1 too high. 


Lastly, the remainder is given back to the caller if the address of where to put it is non-null. The remainder must 
be shifted right by the normalization shift amount S. 


The "add back" steps are executed only rarely. To see this, observe that the first calculation of each estimated 
quotient digit Ghat is done by dividing the most significant two digits of the current remainder by the most 


significant digit of the divisor. The steps in the "again" loop amount to refining qhat to be the result of 


dividing the most significant three digits of the current remainder by the most significant two digits of the 
divisor (proof omitted; convince yourself of this by trying some examples using b = 10). Note that the divisor 


is greater than or equal to 0/2 (because of normalization) and the dividend is less than or equal to b times the 
divisor (because each remainder is less than the divisor). 


How accurate is the quotient estimated by using only three dividend digits and two divisor digits? Because 
normalization was done, it can be shown to be quite accurate. To see this somewhat intuitively (not a formal 
proof), consider estimating u/v in this way for base ten arithmetic. It can be shown that the estimate is always 
high (or exact). Thus, the worst case occurs if truncation of the divisor to two digits decreases the divisor by as 
much as possible in the sense of relative error, and truncation of the dividend to three digits increases it by as 
little as possible (which is 0), and if the dividend is as large as possible. This occurs for the case 49900... 


0/5099...9, which we estimate by 499/50 = 9.98. The true result is approximately 499/51 9.7843. The 
difference of 0.1957 reveals that the estimated quotient digit and the true quotient digit, which are the floor 
functions of these ratios, will differ by at most 1, and this will occur about 20% of the time (assuming the 
quotient digits are uniformly distributed). This in turn means that the "add back" steps will be executed about 
20% of the time. 


Carrying out this (non-rigorous) analysis for a general base b yields the result that the estimated and true 
quotients differ by at most 2/b. For b = 65536, we again obtain the result that the difference between the 


estimated and true quotient digits is at most 1, and this occurs with probability 2/65536 0.00003. Thus the 
"add back" steps are executed for only about 0.003% of the quotient digits. 


An example that requires the add back step is, in decimal, 4500/501. A similar example for base 65536 is 
0x7FFF 800000000000/0x800000000001. 


We will not attempt to estimate the running time of this entire program, but simply note that for large m and n, 
the execution time is dominated by the multiply/subtract loop. On a good compiler this will compile into about 
16 basic RISC instructions, one of which is multiply. The "for j" loop is executed n times, and the multiply/ 


subtract loop m - n + 1 times, giving an execution time for this part of the program of (15 + mul)n(m - n + 1) 
cycles, where mul is the time to multiply two 16-bit variables. The program also executes m - n + 1 divide 
instructions and one number of leading zeros instruction. 


Signed Multiword Division 


We do not give an algorithm specifically for signed multiword division, but merely point out that the unsigned 
algorithm can be adapted for this purpose as follows: 


1. Negate the dividend if it is negative, and similarly for the divisor. 
2. Convert the dividend and divisor to unsigned representation. 

3. Use the unsigned multiword division algorithm. 

4. Convert the quotient and remainder to signed representation. 

5. Negate the quotient if the dividend and divisor had opposite signs. 
6. Negate the remainder if the dividend was negative. 


These steps sometimes require adding or deleting a most significant digit. For example, assume for simplicity 
that the numbers are represented in base 256 (one byte per digit), and that in the signed representation, the high- 
order bit of the sequence of digits is the sign bit. This is much like ordinary two's-complement representation. 
Then, a divisor of 255, which has signed representation OxOOFF, must be shortened in step 2 to OxFF. Similarly, 
if the quotient from step 3 begins with a 1-bit, it must be provided with a leading 0-byte for correct 
representation as a signed quantity. 


9-3 Unsigned Short Division from Signed Division 


By "short division" we mean the division of one single word by another (e.g., 32+32 32). It is the form of 
division provided by the "/" operator, when the operands are integers, in C and many other high-level 
languages. C has both signed and unsigned short division, but some computers provide only signed division in 
their instruction repertoire. How can you implement unsigned division on such a machine? There does not seem 
to be any really slick way to do it, but we offer here some possibilities. 


Using Signed Long Division 


Even if the machine has signed long division (64+32 = 32), unsigned short division is not as simple as you 
might think. In the XLC compiler for the IBM RS/6000, it is implemented as illustrated below for 


q+ (nžd). 


IF a £ d then g —0 

else if d = 1 then g&n 
else if d = 1 then g & 1 
else g — (0 || m) + d 


uH 3l 
The third line is really testing to see if d 2 2 ‘If d is algebraically less than or equal to 1 at this point, then 
because it is not equal to 1 (from the second line), it must be algebraically less than or equal to 0. We don't care 
about the case d = 0, so for the cases of interest, if the test on the third line evaluates to true, the sign bit of d is 


H 3l ti 
on, that is, d = 2 ‘Because from the first line it is known that @ = d 


23-1, nid = 1. 


*and because n cannot exceed 


The notation on the fourth line means to form the double-length integer consisting of 32 0-bits followed by the 
32-bit quantity n, and divide it by d. The test for d = 1 (second line) is necessary to ensure that this division 


yal 
does not overflow (it would overflow if ne 2 *and then the quotient would be undefined). 


[3] 
By commoning the comparisons on the second and third lines, the above can be implemented in 11 
instructions, three of which are branches. If it is necessary that the divide be executed when d = 0, to get the 
overflow interrupt, then the third line can be changed to "else if d < 0 then q ==1," giving a 12-instruction 
solution on the RS/6000. 


[B] One execution of the RS/6000's compare instruction sets multiple status bits indicating less than, greater than, or 
equal. 


PET) ft 


3> 
It is a simple matter to alter the above code so that the probable usual cases <2 ldo not go through 


<1. 


so many tests (begin with d ..), but the code volume increases slightly. 


Using Signed Short Division 


fa 
If signed long division is not available, but signed short division is, then ” +4 can be implemented by 
somehow reducing the problem to the case n, d < 231, and using the machine's divide instruction. If 


i HS 
d2 ‘then # #d can only be 0 or 1, so this case is easily dispensed with. Then, we can reduce the dividend 


n (( £2) +d) x2 


by using the fact that the expressio approximates * +d with an error of only 0 or 1. This 


leads to the following method: 


1. if d < Ò then if n ed then g + 0 
2, else g — 1 

3. else do 

4. q<—((n42)+d)x2 

5. ren-gd 

6. ifrdtheng—q+1 

7: end 


l l l . diplo giz l 
The test d < 0 on line 1 is really testing to determine if * =“ ‘If “= then the largest the quotient could 
be is (232 - 1) + 231 = 1, so the first two lines compute the correct quotient. 


ua 93] 


Line 4 represents the code shift right unsigned 1, divide, shift left 1. Clearly, ® += < and at this point 


ae | 
daa as well, so these quantities can be used in the computer's signed division instruction. (If d = 0, 
overflow will be signaled here.) 


The estimate computed at line 4 is 


q = LLn/2J/dJ-2 = Ln/(2d)J-2 = Eemia ne 
; 


where we have used the corollary of Theorem D3. Line 5 computes the remainder corresponding to the 
estimated quotient. It is 


_ foe DT 
H rem(it, 2d) y 


= rem(n, d). 
al 


ren 


Thus, 0 <, < 2d. If r < d, then q is the correct quotient. If r >a, then adding 1 to q gives the correct quotient 


(the program must use an unsigned comparison here because of the possibility that r 2231), 


By moving the load immediate of 0 into q ahead of the comparison * < d, and coding the assignment q #=1 in 
line 2 as a branch to the assignment q *~q + 1 in line 6, this can be coded in 14 instructions on most machines, 

four of which are branches. It is straightforward to augment the code to produce the remainder as well: to line 1 
append r #—n, to line 2 append r =n - d, and to the "then" clause in line 6 append r +r - d. (Or, at the cost of 
a multiply, simply append r =n - qd to the end of the whole sequence.) 


An alternative for lines 1 and 2 is 


ifn 2d then g + 0 


else if d < 0 then g = 1, 


which can be coded a little more compactly, for a total of 13 instructions, three of which are branches. But it 
executes more instructions in what is probably the usual case (small numbers with n > d). 


Using predicate expressions, the program can be written 


L if d<O then g + (asd) 


2, else do 


a q ((n $2) +d)x2 


4. ren-gd 
5. q—qt(rod) 
6. end 


which saves two branches if there is a way to evaluate the predicates without branching. On the Compaq Alpha 
they can be evaluated in one instruction (CMPULE); on MIPS they take two (SLTU, XORI). On most 


computers, they can be evaluated in four instructions each (three if equipped with a full set of logic 
instructions), by using the expression for * SY given in "Comparison Predicates" on page 21, and simplifying 
because on line 1 of the program above it is known that d3, = 1, and on line 5 it is known that d3, = 0. The 


expression simplifies to 


nod =(n&-a(n—d))+31 online 1, and 


r3d=(r|—a(r-d))+31 online 5. 


We can get branch-free code by forcing the dividend to be 0 when d > 2, Then, the divisor can be used in 
the machine's signed divide instruction, because when it is misinterpreted as a negative number, the result is set 
to 0, which is within 1 of being correct. We'll still handle the case of a large dividend by shifting it one position 
to the right before the division, and then shifting the quotient one position to the left after the division. This 
gives the following program (ten basic RISC instructions): 


1 à teds3l 

2. nenat 

3 - 4 (nw $2) +d) x2 
4. re—n—-gd 


5; qa—q+(rid) 


9-4 Unsigned Long Division 


By "long division" we mean the division of a doubleword by a single word. For a 32-bit machine, this is 


ite A 
64232 = 32 division, with the result unspecified in the overflow cases, including division by 0. 


Some 32-bit machines provide an instruction for unsigned long division. Its full capability, however, gets little 
use, because only 32432 = 32 division is accessible with most high-level languages. Therefore, a computer 
designer might elect to provide only 32432 d division, and would probably want an estimate of the 


execution time of a subroutine that implements the missing function. Here we give two algorithms for 
providing this missing function. 


Hardware Shift-and-Subtract Algorithms 


As a first attempt at doing long division, we consider doing what the hardware does. There are two algorithms 
commonly used, called restoring and nonrestoring division [H&P, sec. A-2; EL]. They are both basically "shift- 
and-subtract" algorithms. In the restoring version, shown below, the restoring step consists of adding back the 
divisor when the subtraction gives a negative result. Here x, y, and z are held in 32-bit registers. Initially, the 
double-length dividend is x || y, and the divisor is z. We need a single-bit register c to hold the overflow from 
the subtraction. 


do i+ | to 32 


clilx ily e 2(x Illy) H Shift lett one. 
clle e {e llx)—(0601 z) M Subtract (33 bits). 
¥o ae N Set one bit of quotient. 


if e then e lx —(ellx)+(O060\lz) V Restore. 
end 


Upon completion, the quotient is in register y and the remainder is in register x. 


The algorithm does not give a useful result in the overflow cases. For division of the doubleword quantity x || y 
by 0, the quotient obtained is the one's-complement of x, and the remainder obtained is y. In particular, 


a 442 agi i 
0+0 = 2- ] rem 0. The other overflow cases are difficult to characterize. 


It might be useful if, for nonzero divisors, the algorithm would give the correct quotient modulo 232, and the 
correct remainder. However, the only way to do this seems to be to make the register represented by c || x || y 


above 97 bits long, and do the loop 64 times. This is doing 62232 = 64 division. The subtractions would 


still be 33-bit operations, but the additional hardware and execution time make this refinement probably not 
worthwhile. 


This algorithm is difficult to implement exactly in software, because most machines do not have the 33-bit 
register that we have represented by c || x. Figure 9-2, however, illustrates a shift-and-subtract algorithm that 
reflects the hardware algorithm to some extent. 


The variable t is used for a device to make the comparison come out right. We want to do a 33-bit comparison 
after shifting X | | y. If the first bit of X is 1 (before the shift), then certainly the 33-bit quantity is greater than 
the divisor (32 bits). In this case, X | t is all 1's, so the comparison gives the correct result (true). On the other 
hand, if the first bit of X is 0, then a 32-bit comparison is sufficient. 


The code of the algorithm in Figure 9-2 executes in 321 to 385 basic RISC instructions, depending upon how 
often the comparison is true. If the machine has shift left double, the shifting operation can be done in one 
instruction, rather than the four used above. This would reduce the execution time to about 225 to 289 
instructions (we are allowing two instructions per iteration for loop control). 


Figure 9-2 Divide long unsigned, shift-and-subtract algorithm. 
unsigned divlu(unsigned x, unsigned y, unsigned z) { 
// Divides (x || y) by zZ. 
int i; 
unsigned t; 


for (i = 1; i <= 32; i++) { 


t = (int)x >> 31; // All 1's if x(31) = 1. 
x= (x << 1) | (y >> 31); // Shift x || y left 
y=y << 1; // one bit. 
AF ((x | Ty) o> 2) 4 
X =X - Z; 
y=yti, 
i; 
} 
return y; // Remainder is x. 


The algorithm in Figure 9-2 can be used to do 32 £32 = 32 division by supplying X = 0. The only 
simplification that results is that the variable t can be omitted, as its value would always be 0. 


Below is the nonrestoring hardware division algorithm (unsigned). The basic idea is that, after subtracting the 
divisor z from the 33-bit quantity that we denote by c || x, there is no need to add back z if the result was 


negative. Instead, it suffices to add on the next iteration, rather than subtract. This is because adding z (to 

correct the error of having subtracted z on the previous iteration), shifting left, and subtracting z is equivalent to 

adding z (2(u + z) - z = 2u + z). The advantage to hardware is that there is only one add or subtract operation on 
[4] 

each loop iteration, and the adder is likely to be the slowest circuit in the loop. | An adjustment to the 

remainder is needed at the end, if it is negative. (No corresponding adjustment of the quotient is required.) 


[4] Actually, the restoring division algorithm can avoid the restoring step by putting the result of the subtraction in an 
additional register, and writing that register into x only if the result of the subtraction (33 bits) is nonnegative. But in 
some implementations this may require an additional register and possibly more time. 


The input dividend is the doubleword quantity x || y, and the divisor is z. Upon completion, the quotient is in 
register y and the remainder is in register x. 


ec = 0) 
do i l to 32 
ife = 0 then do 


ellx ily — 2tx Ily) H Shift lett one. 
elle @—(ellx)-—(060Nz2) / Subtract divisor. 
end 
else do 
ellx lly —2(x lly) ff Shift left one. 
ellee—(elley+(0b0 lz) V Add divisor. 
end 
Yo Ae f/f Set one bit of quotient. 
end 
fe = lthenxvrex+z df Adjust remainder if negative, 


This does not seem to adapt very well to a 32-bit algorithm. 


The 801 minicomputer (an early experimental RISC machine built by IBM) had a divide step instruction that 
essentially performed the steps in the body of the loop above. It used the machine's carry status bit to hold c, 
and the MQ (a 32-bit register) to hold y. A 33-bit adder/subtracter is needed for its implementation. The 801's 
divide step instruction was a little more complicated than the loop above, because it performed signed division 
and it had an overflow check. Using it, a division subroutine can be written that consists essentially of 32 
consecutive divide step instructions followed by some adjustments to the quotient and remainder to make the 
remainder have the desired sign. 


Using Short Division 


An algorithm for 64 £32 => 32 division can be obtained from the multiword division algorithm of Figure 9-1 


on page 141, by specializing it to the case m = 4, n = 2. Several other changes are necessary. The parameters 
should be fullwords passed by value, rather than arrays of halfwords. The overflow condition is different; it 
occurs if the quotient cannot be contained in a single fullword. It turns out that many simplifications to the 
routine are possible. It can be shown that the guess Ghat is always exact; it is exact if the divisor consists of 


only two halfword digits. This means that the "add back" steps can be omitted. If the "main loop" of Figure 9-1 
and the loop within it are unrolled, some minor simplifications become possible. 


The result of these transformations is shown in Figure 9-3. The dividend is in U1 and UO, with U1 containing 
the most significant word. The divisor is parameter V. The quotient is the returned value of the function. If the 
caller provides a non-null pointer in parameter r, the function will return the remainder in the word to which r 
points. 


For an overflow indication, the program returns a remainder equal to the maximum unsigned integer. This is an 
impossible remainder for a valid division operation, because the remainder must be less than the divisor. In the 
overflow case, the program also returns a quotient equal to the maximum unsigned integer, which may be an 
adequate indicator in some cases in which the remainder is not wanted. 


The strange expression ( - S >> 31) in the assignment to U32 is supplied to make the program work for the 
case S = © on machines that have mod 32 shifts (e.g., Intel x86). 


Experimentation with uniformly distributed random numbers suggests that the bodies of the "again" loops are 
each executed about 0.38 times for each execution of the function. This gives an execution time, if the 
remainder is not wanted, of about 52 instructions. Of these instructions, one is number of leading zeros, two are 
divide, and 6.5 are multiply (not counting the multiplications by b, which are shift's). If the remainder is 


wanted, add six instructions (counting the store of r), one of which is multiply. 


What about a signed version of div lu? It would probably be difficult to modify the code of Figure 9-3, step 


by step, to produce a signed variant. That algorithm, however, may be used for signed division by taking the 
absolute value of the arguments, running div Lu, and then complementing the result if the signs of the original 


arguments differ. There is no problem with extreme values such as the maximum negative number, because the 
absolute value of any signed integer has a correct representation as an unsigned integer. This algorithm is 
shown in Figure 9-4. 


Figure 9-3 Divide long unsigned, using fullword division instruction. 


unsigned divlu(unsigned u1, unsigned u0, unsigned v, 
unsigned *r) { 


const unsigned b = 65536; // 


unsigned uni, und, // 
vni, vno, // 

qi, qo, ey 

un32, un21, un10,// 

rhat; // 

int S; // 
if (u1 >= v) { // 
if (r != NULL) // 


*r = OxXFFFFFFFF; // 
return OXFFFFFFFF; } // 


s = nlz(v); // 
v=vV << S; // 
vní = v >> 16; // 
vn@ = v & OXxFFFF; 1 


un32 = (u1 << s) | (u0 >> 32 


un10 = uO << S; // 

uni = un10 >> 16; // 

unO = un10 & OXFFFF; // 

qi = un32/vni; ft 

rhat = un32 - qi*vn1; // 
againi: 


Number base (16 bits). 
Norm. dividend LSD's. 
Norm. divisor digits. 
Quotient digits. 
Dividend digit pairs. 
A remainder. 

Shift amount for norm. 


If overflow, set rem. 
to an impossible value, 
and return the largest 
possible quotient. 


@ <= $ <= 31. 
Normalize divisor. 
Break divisor up into 
two 16-bit digits. 


- S) & (-S >> 31); 
Shift dividend left. 


Break right half of 
dividend into two digits. 


Compute the first 
quotient digit, q1. 


if (qi >= b || gi*vnO > b*rhat + uni) { 


qi = qi = 1; 
rhat = rhat + vn1; 
if (rhat < b) goto again1;} 


un21 = un32*b + uní - qi*v; 


q0 = un21/vni; // 
rhat = un21 - qQ*vn1; Tf 
again2 


// Multiply and subtract. 


Compute the second 
quotient digit, q0. 


if (q0 >= b || qO*vnO > b*rhat + unO) { 


go = g0 = 1; 
rhat = rhat + vn1; 
if (rhat < b) goto again2;} 


if (r != NULL) // If remainder is wanted, 
*r = (un21*b + unO® - qO*v) >> s; // return it. 
return gi*b + q0; 
} 


Figure 9-4 Divide long signed, using divide long unsigned. 


int divls(int u1, unsigned u0, int v, int *r) { 
int q, uneg, vneg, diff, borrow; 


uneg = u1 >> 31; // -1 if u <0. 

if (uneg) { // Compute the absolute 
uO = -u0; // Value of the dividend u. 
borrow = (u0 != 0); 
ul = -u1 - borrow; } 

vneg = v >> 31; // =1. af VW <0; 

v = (v ^ vneg) - vneg; // Absolute value of v. 


if ((unsigned)u1 >= (unsigned)v) goto overflow; 

q = divlu(ui, u0, v, (unsigned *)r); 

diff = uneg ^ vneg; // Negate q if signs of 
q = (q ^ diff) - diff; // u and v differed. 

if (uneg && r != NULL) 


žr = sara 


if ((diff ^q) <0 && q != 0) { // If overflow, 


overflow: // set remainder 
if (r != NULL) // to an impossible value, 
*r = 0x80000000; // and return the largest 
q = 0x80000000;} // possible neg. quotient. 
return q; 
} 


It is hard to devise really good code to detect overflow in the signed case. The algorithm shown in Figure 9-4 
makes a preliminary determination identical to that used by the unsigned long division routine, which ensures 
that |u/v| <232. After that, it is necessary only to ensure that the quotient has the proper sign or is 0. 


Chapter 10. Integer Division by Constants 


On many computers, division is very time consuming and is to be avoided when possible. A value of 20 or 
more elementary add times is not uncommon, and the execution time is usually the same large value even when 
the operands are small. This chapter gives some methods for avoiding the divide instruction when the divisor is 


a constant. 


10-1 Signed Division by a Known Power of 2 


Apparently, many people have made the mistake of assuming that a shift right signed of k positions divides a 
number by 2k, using the usual truncating form of division [GLS2]. It's a little more complicated than that. The 


code shown below computes q = n + 2k, for 1 Sk S31 [Hop]. 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k - 1 if n < 0, else O. 
add t,n,t Add it to n, 

shrsi q,t,k and shift right (signed). 


It is branch-free. It also simplifies to three instructions in the common case of division by 2 (k = 1). It does, 
however, rely on the machine's being able to shift by a large amount in a short time. The case k = 31 does not 
make too much sense, because the number 22! is not representable in the machine. Nevertheless, the code does 
produce the correct result in that case (which is q = -1 if n = -231 and q = 0 for all other n). 


To divide by -2K, the above code may be followed by a negate instruction. There does not seem to be any better 
way to do it. 


The more straightforward code for dividing by 2k is 


bge n, label Branch if n >= 0. 
addi n,n,2**k-1 Add 2**k - 1 to n, 
label shrsi n,n,k and shift right (signed). 


This would be preferable on a machine with slow shifts and fast branches. 


PowerPC has an unusual device for speeding up division by a power of 2 [GGS]. The shift right signed 
instructions set the machine's carry bit if the number being shifted is negative and one or more 1-bits are shifted 
out. That machine also has an instruction for adding the carry bit to a register, denoted addze. This allows 


division by any (positive) power of 2 to be done in two instructions: 


shrsi q,n,k 
addze q,q 


A single shr si of k positions does a kind of signed division by 2k that coincides with both modulus and floor 


division. This suggests that one of these might be preferable to truncating division for computers and HLL's to 
use. That is, modulus and floor division mesh with shr si better than does truncating division, permitting a 


compiler to translate the expression n/2 to an Shr Si. Furthermore, shr si followed by neg (negate) does 


modulus division by -2k, which is a hint that maybe modulus division is best. (However, this is mainly an 
aesthetic issue. It is of little practical significance because division by a negative constant is no doubt extremely 
rare.) 


10-2 Signed Remainder from Division by a Known Power of 2 


If both the quotient and remainder of n + 2k are wanted, it is simplest to compute the remainder r from r = q * 
2k - n. This requires only two instructions after computing the quotient q: 


shli r,q,k 
sub r,r,n 


To compute only the remainder seems to require about four or five instructions. One way to compute it is to use 
the four-instruction sequence above for signed division by 2k, followed by the two instructions shown 
immediately above to obtain the remainder. This results in two consecutive shift instructions that can be 
replaced by an and, giving a solution in five instructions (four if k = 1): 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k - 1 if n < ©, else O. 
add t,n,t Add it to n, 

andi t,t,-2**k clear rightmost k bits, 
sub npt and subtract it from n. 


Another method is based on 


in &(2*-1), n=O, 
—((-#) & (2*-1)), «<0. 


remin, 2*) = 


fons 3l 


To use this, first compute *and then 


ré (labs) & (24-1)) @ry-e 


(five instructions) or, for k = 1, since (-n)& 1=n & 1, 


re—((n&1)@Fr)-t 


(four instructions). This method is not very good for k > 1 if the machine does not have absolute value 
(computing the remainder would then require seven instructions). 


Still another method is based on 


né&(2*-1), n20, 


rem(n, 2") = 
((n + 2% — 1) & (24-1))- (24-1), 2 <0. 


This leads to 


tc(nsk-1)%32-k 


re—((n +f) & (2*-1))-8 


(five instructions for k > 1, four for k = 1). 
The above methods all work for 1 Sk $31. 


Incidentally, if shift right signed is not available, the value that is 2‘ - 1 for n < 0 and 0 for n =0 can be 
constructed from 


fons 31 


ré (t æ k-i 


which adds only one instruction. 


10-3 Signed Division and Remainder by Non-Powers of 2 


The basic trick is to multiply by a sort of reciprocal of the divisor d, approximately 232/d, and then to extract 
the leftmost 32 bits of the product. The details, however, are more complicated, particularly for certain divisors 
such as 7. 


Let us first consider a few specific examples. These illustrate the code that will be generated by the general 
method. We denote registers as follows: 


n - the input integer (numerator) 
M - loaded with a "magic number" 
t - a temporary register 

q - will contain the quotient 


r - will contain the remainder 


Divide by 3 

li M, 0x55555556 Load magic number, (2**32+2)/3. 
mulhs q,M,n q = floor(M*n/2**32). 

shri t,n,31 Add 1 to q if 

add q,q,t n is negative. 

muli t,q,3 Compute remainder from 

sub r,n,t r=n - q*3. 


Proof. The multiply high signed operation (mu Lh S) cannot overflow, as the product of two 32-bit integers can 
always be represented in 64 bits and mULNS gives the high-order 32 bits of the 64-bit product. This is 
equivalent to dividing the 64-bit product by 232 and taking the floor of the result, and this is true whether the 


product is positive or negative. Thus, for n 20 the above code computes 


(22429 | |a 2n 
1 | 3 A ~ aram) 


Now, n < 231, because 23! - 1 is the largest representable positive number. Hence the "error" term 2n/(3 - 232) is 


less than 1/3 (and is nonnegative), so by Theorem D4 (page 139) we have {f = Ln/3 J, which is the desired 
result (Equation (1) on page 138). 


For n < 0, there is an addition of 1 to the quotient. Hence the code computes 


1% a : = F 
q= 2+2 n |y] a |2nt2n+3.2]| _ [220+ 2n+1 
3 2% 4.932 4.922 i 


where we have used Theorem D2. Hence 


n antl 
q5 [pe] 


The error term is nonpositive and greater than -1/3, so by Theorem D4 4 = [s3] *which is the desired result 


(Equation (1) on page 138). 


This establishes that the quotient is correct. That the remainder is correct follows easily from the fact that the 
remainder must satisfy 


n= gd+r, 


< 


the multiplication by 3 cannot overflow (because -231/3 =q $231 - 1)/3), and the subtract cannot overflow 
because the result must be in the range -2 to +2. 


The multiply immediate can be done with two add's, or a shift and an add, if either gives an improvement in 


execution time. 


On many present-day RISC computers, the quotient can be computed as shown above in nine or ten cycles, 
whereas the divide instruction might take 20 cycles or so. 


Divide by 5 


For division by 5, we would like to use the same code as for division by 3, except with a multiplier of (232 + 


4)/5. Unfortunately, the error term is then too large; the result is off by 1 for about 1/5 of the values of n = 730 
in magnitude. However, we can use a multiplier of (233 + 3)/5 and add a shift right signed instruction. The code 
is 


li M, 0x66666667 Load magic number, (2**33+3)/5. 


mulhs q,M,n q = floor(M*n/2**32). 
shrsi q,q,1 

shri t,n,31 Add 1 to q if 

add q,q,t n is negative. 

muli t,q,5 Compute remainder from 
sub nyt r =n - qř5. 


Proof. The mulhs produces the leftmost 32 bits of the 64-bit product, and then the code shifts this right by 
one position, signed (or "arithmetically"). This is equivalent to dividing the product by 233 and then taking the 


floor of the result. Thus, for n >0 the code computes 


For 0 San < 231, the error term 3n/5 - 233 is nonnegative and less than 1/5, so by Theorem D4, 4 = La/5]. 


For n < 0 the above code computes 


433 s 
a= |? a1 el Cae 
5 733 5 5.933 


The error term is nonpositive and greater than -1/5, so = [n/5 i 

That the remainder is correct follows as in the case of division by 3. 

The multiply immediate can be done with a shift left of two and an add. 

Divide by 7 

Dividing by 7 creates a new problem. Multipliers of (232 + 3)/7 and (233 + 6)/7 give error terms that are too 
large. A multiplier of (234 + 5)/7 would work, but it's too large to represent in a 32-bit signed word. We can 
multiply by this large number by multiplying by (234 + 5)/7 - 232 (a negative number), and then correcting the 


product by inserting an add. The code is 


li M, 0x92492493 Magic num, (2**34+5)/7 - 2**32. 


mulhs q,M,n q = fTloor(M*n/2**32). 

add q,q,n q = floor(M*n/2**32) + n. 
shrsi q,q,2 q = floor(q/4). 

shri t,n,31 Add 1 to q if 

add d, g; t n is negative. 

muli t,q,7 Compute remainder from 
sub r,n,t r =n - q*7. 


Proof. It is important to note that the instruction "add q, q, N" above cannot overflow. This is because q and 
n have opposite signs, due to the multiplication by a negative number. Therefore, this "computer arithmetic" 


addition is the same as real number addition. Hence for n 20 the above code computes 


(E +5_ 2”) | + ae 2 | +5n-—7-27*n+7- me ya 
7 232 7.23 


_|a, sn 
E Heel, 


wy 
I 


where we have used the corollary of Theorem D3. 


For 0 Sn< 231, the error term 5n/7 - 234 is nonnegative and less than 1/7,so 4 = Ln d 7]. 


For n < 0, the above code computes 


E m- J -|£ Sr | 
ae (i 7 O“ afery thes EE 


The error term is nonpositive and greater than -1/7, so = [n/7]. 


The multiply immediate can be done with a shift left of three and a subtract. 


10-4 Signed Division by Divisors 22 


At this point you may wonder if other divisors present other problems. We see in this section that they do not; 


the three examples given illustrate the only cases that arise (for d =), 


Some of the proofs are a bit complicated, so to be cautious, the work is done in terms of a general word size W. 


Given a word size W 23 and a divisor d, 2 Sa < 2W-1, we wish to find the least integer m and integer p such 
that 


Equation 1a 


may — |i forQOean<2"-!, and 
Fp d 


Equation 1b 


MA |b] =|” for —3"-len<-l, 
oP ra 


with 0 Sm < 2W and p = Ww. 


The reason we want the least integer m is that a smaller multiplier may give a smaller shift amount (possibly 
zero) or may yield code similar to the "divide by 5" example, rather than the "divide by 7" example. We must 


< 


have m =2W - 1 so the code has no more instructions than that of the "divide by 7" example (that is, we can 
handle a multiplier in the range 2W - 1 to 2W - 1 by means of the add that was inserted in the "divide by 7" 
example, but we would rather not deal with larger multipliers). We must have p 2w because the generated 


code extracts the left half of the product mn, which is equivalent to shifting right W positions. Thus, the total 
right shift is W or more positions. 


There is a distinction between the multiplier m and the "magic number," denoted M. The magic number is the 
value used in the multiply instruction. It is given by 


nt, if Ofm<2¥-!, 


M = 
m-2", if2¥-'sme 2", 
Because (1b) must hold for ™ = =d, |-md/2? ]+1 = -1 *which implies 
Equation 2 
mit l. 
aP 


Let n, be the largest (positive) value of n such that (n,, d) = d - 1. n, exists because one possibility is n, = d - 1. 


It can be calculated from "e = L2™='d]d-1 = 2W-!-— rem(2"-1, d) - l-n, is one of the highest d 


admissible values of n, so 


Equation 3a 


2W-1_d<n.s2¥-1-1, 


and clearly 
Equation 3b 


n.2d— 1. 


Because (1a) must hold for n = no, 


mn, | _ |n| _ "e-td- 1) 
| (aj d 


+ 


or 


ma. n+l 
< =, 
2P d 


Combining this with (2) gives 


Equation 4 

2” o 2Pn, + | 
— << 

l don 


Because m is to be the least integer satisfying (4), it is the next integer greater than 2P/d; that is, 


Equation 5 


2 + df —rem(2", d) 


m = 
d 


Combining this with the right half of (4) and simplifying gives 


Equation 6 


The Algorithm 


Thus, the algorithm to find the magic number M and the shift amount s from d is to first compute n,, and then 
solve (6) for p by trying successively larger values. If p < W, set p = W (the theorem below shows that this 


value of p also satisfies (6)). When the smallest p 2w satisfying (6) is found, m is calculated from (5). This is 
the smallest possible value of m, because we found the smallest acceptable p, and from (4) clearly smaller 
values of p yield smaller values of m. Finally, s = p - W and M is simply a reinterpretation of m as a signed 
integer (which is how the mulhs instruction interprets it). 


Forcing p to be at least W is justified by the following: 
Theorem DC1. If (6) is true for some value of p, then it is true for all larger values of p. 


Proof. Suppose (6) is true for p = pọ. Multiplying (6) by 2 gives 


2Po*! = p (2d — 2rem(2"*, d)). 


5 rem(2?* ' dd) = 2rem(2” 


From Theorem D , d) — el. Combining gives 


aot! 5, n (2d — (rem(2"0* | dy4+d)), or 


nh, + ] 


atl s nd- rem?" , dy). 


Therefore, (6) is true for p = pọ + 1, and hence for all larger values. 


Thus, one could solve (6) by a binary search, although a simple linear search (starting with p = W) is probably 
preferable, because usually d is small, and small values of d give small values of p. 


Proof That the Algorithm Is Feasible 


We must show that (6) always has a solution and that 0 Em < 2W, (It is not necessary to show that p >w, 
because that is forced.) 


We show that (6) always has a solution by getting an upper bound on p. As a matter of general interest, we also 
derive a lower bound under the assumption that p is not forced to be at least W. To get these bounds on p, 
observe that for any positive integer x, there is a power of 2 greater than x and less than or equal to 2x. Hence 
from (6), 


na —rem(2", d) < 2" £ 2n (d — rem(2”, e)). 


Because 0 Srem(2P, d) <a - 1, 
Equation 7 


n+ l£?" Sinad. 


From (3a) and (3b), n, = max(2W- 1-d,d-1). The lines f,(d) = 2W -1 - d and f(d) = d - 1 cross at d = (2W -1 


+ 1)/2. Hence ne >w- 1 - 1)/2. Because n, is an integer, ne = W -2. Because no, d Sow-1- 1, (7) becomes 


2W-24.1 29 <2(2W-1_ 1)’, 


or 
Equation 8 


W-leps2W-2. 


The lower bound p = W - 1 can occur (e.g., for W = 32, d = 3), but in that case we set p = W. 


If p is not forced to equal W, then from (4) and (7), 


n+l 2n adn, + | 
<m —— 
d d 


Mo 


Using (3b) gives 


ditl ema 2(n. +1). 


Because n, Sow-1- 1(3a), 


25m5 2-1, 


If p is forced to equal W, then from (4), 


20 awn. + | 
ŽŽ et a 
ad don 


È 


Because 2 Sd S2W-1-1 and Ne 2W-2 


wWaw-24] 


2 
— a mM s m OF 
3W-1i_] T a2 


3505 2-14 1. 


Hence in either case m is within limits for the code schema illustrated by the "divide by 7" example. 
Proof That the Product Is Correct 
We must show that if p and m are calculated from (6) and (5), then equations (1a) and (1b) are satisfied. 


Equation (5) and inequality (6) are easily seen to imply (4). (In the case that p is forced to be equal to W, (6) 
still holds, as shown by Theorem DC1.) In what follows, we consider separately the following five ranges of 
values of n: 


Qfinen,, 
n.t+ilsasn.+d—1, 
-—n.Sns-l, 
—a,-d+1Sns-a,-1, and 


n= -ñ =d. 


From (4), because m is an integer, 


an Win + 1)—1 
= <n = —— 
d dno 


Multiplying by n/2P, for n 20 this becomes 


gP 
nomn 2 nt. +t l)-n 
-£ — = —_—+—_—__,, so that 


d 2 WPdn, 


n |s| mn |s| 24 E | 
d IW | |d dn, | 


For 0 En Sn, 0 Sp - 1)n/(2Pdn,) < 1/d, so by Theorem D4, 
ny (2? —1)n -lal 
d Pa n, al 


< 


Hence (1a) is satisfied in this case (0 En =n,). 


For n > n,, nis limited to the range 


Equation 9 


n.+lensn.+d-1, 


because n =n. + d contradicts the choice of n, as the largest value of n such that rem(n,, d)=d- 1 


(alternatively, from (3a), n =n, + d implies n 2w- 1). From (4), for n 2o, 


n mu m nne + | 


d 2 d n, 


By elementary algebra, this can be written 


Equation 10 


noma n+ | (a—n,)(n, + 1) 
d d dn, l 


From (9), 1 Sn -Ne Sq - 1, so 


iz (a—n,)Ut, + l) <d-l"et l 
dn od n 


Lt C 


Because n >d - 1 (by (3b)) and (n, + 1)/n, has its maximum when n, has its minimum, 
C C Cc Cc 


gc eae + od- ld-l+1_, 
dn, d è d-l 


In (10), the term (n, + 1)/d is an integer. The term (n - n,)(n, + 1)/dn, is less than or equal to 1. Therefore, (10) 
becomes 


n z mn amr. 
af 2r a 


For all n in the range (9), Lard] = (n+ 1)/d. 


Hence (1a) is satisfied in this case (n, + 1 Sn Sn, + d-1). 


For n < 0, from (4) we have, because m is an integer, 


“yi ohn + | 
aw + l n< oy È 


ff. 


Multiplying by n/2P for n < 0 this becomes 


nil, + l F LLLE naa] 
dn, 2 d 3° 


or 


neti] el mn |gs Z+ |a, 
don, oP d 2? 


Using Theorem D2 gives 


fi Pd 


n(n, +1) +1 maya pe) n(2? +1} +1) 
dn, oP of d 


| ea ~ dn, + | l ilg hé ris aes 1)-2"'d+1 


dn, 


Because n + 1 So, the right inequality can be weakened, giving 


Equation 11 


aye+llelmnl,)<[2], 
d dn, Pai d 


n+l iW 
mje m gei + 
k d dn, |- H 


so that (1b) is satisfied in this case (-n, Sn £1). 


Forn < -ne n is limited to the range 


a 


Equation 12 


-Hn,-dSns-h.- l. 


(From (3a), n < - nę - d implies that n < -2W - 1, which is impossible.) Performing elementary algebraic 


manipulation of the left comparand of (11) gives 


Equation 13 


=n,- 1 p REN l)+l clmn|ape< n], 
d dn, aP d 


£Z S 


For -no -d =n =- n,- 1, 


(=d+ l)(n.+ a l e UHN, +1lj)+l Pd ytd] 
dn, dne dn, 7 dn, — 


The ratio (n, + 1)/n, is a maximum when n, is a minimum; that is, n, = d - 1. Therefore, 


(d+ lid- 1+ D, tnd, + DE 
did- 1) dn dn, 


(n+ n n+ 1) +1 


<0. ofr 


-İz zÜ. 


dn, 


From (13), because (- n, - 1)/d is an integer and the quantity added to it is between 0 and -1, 


£Z S 


For n in the range - no -d +1 =n =-n,- 1, 
aļ wen l 
d d 


Hence Lonn/2? |+1 = [a @\_that is, (1b) is satisfied. 


< 


The last case, n = - n, - d, can occur only for certain values of d. From (3a), - n, - d =- 2W - 1, so if n takes on 
this value, we must have n = -n, - d = -2W - 1, and hence n, = 2W -1 - d. Therefore, rem(2W - 1, d)= rem(n, + d, 


d) = d - 1 (that is, d divides 2W -1 + 1). 
For this case (n = - n, - d), (6) has the solution p = W - 1(the smallest possible value of p), because for p = W - 1, 


n(d —rem(2", d)) = (2"-! -@)(d - rem(2"-', d)) 


= (2¥-1_ a(d—-(d-1)) = 2¥-! -d< 2-1 = 7, 


Then from (5), 


2W-14d—rem(2¥-',d) _ 2W-14d-(d-1) _ 2-141 
d d do 


m = 


Therefore, 


mn |] = W144 9W-1 apa |o2¥1=-1[ 44 
op ad Wel F 
wo tad], = |2] = /2 

a d d 7 


li 


so that (1b) is satisfied. 


This completes the proof that if m and p are calculated from (5) and (6), then Equations (1a) and (1b) hold for 
all admissible values of n. 


10-5 Signed Division by Divisors 3-2 


Because signed integer division satisfies n + (-d) = -(n + d) it is adequate to generate code for n + |d| and follow 
it with an instruction to negate the quotient. (This does not give the correct result for d = -2W - 1, but for this and 
other negative powers of 2, you can use the code in Section 10-1, "Signed Division by a Known Power of 2," 
on page 155, followed by a negating instruction.) It will not do to negate the dividend, because of the 
possibility that it is the maximum negative number. 


It is possible, however, to avoid the negating instruction. The scheme is to compute 


ae ka ifn<0, and 
y 


q= kale ifn > 0. 
2P 


Adding 1 if n > 0, however, is awkward (because one cannot simply use the sign bit of n), so the code will 
instead add 1 if q < 0. This is equivalent because the multiplier m is negative (as will be seen). 


The code to be generated is illustrated below for the case W = 32, d = -7. 


li M, Ox6DB6DB6D Magic num, -(2**34+5)/7 + 2**32, 


mulhs q,M,n q = floor(M*n/2**32). 

sub q,q,n q = floor(M*n/2**32) - n. 
shrsi q,q,2 q = floor(q/4). 

shri t,q,31 Add 1 to q if 

add q,q,t q is negative (n is positive). 
muli t,q,-7 Compute remainder from 

sub r,n,t r =n - q*(-7). 


This code is the same as that for division by +7, except that it uses the negative of the multiplier for +7, and a 
Sub rather than an add after the multiply, and the Shri of 31 must use q rather than n, as discussed above. 


(The case of d = +7 could also use q here, but there would be less parallelism in the code.) The subtract will not 
overflow because the operands have the same sign. This scheme, however, does not always work! Although the 
code above for W = 32, d = -7 is correct, the analogous alteration of the "divide by 3" code to produce code to 


divide by -3 does not give the correct result for W = 32, n = -231, 


Let us look at the situation more closely. 


Given a word size W >, and a divisor d, -2W - 1 <a 


m and integer p such that 


Equation 14a 


may |n for -2"-! €a <0, and 
oP él 


Equation 14b 


mals] =|Ë| folsn<2¥-!, 
2P d 


with -2W Sm So and p Sw. 


< 


2, we wish to find the least (in absolute value) integer 


Proceeding similarly to the case of division by a positive divisor, let n, be the most negative value of n such 


that n, = kd + 1 for some integer k. n, exists because one possibility is n, = d + 1. It can be calculated from 


He 


values of n, so 

Equation 15a 

-3W-1 gp <-2¥-1-4-1, 
and clearly 

Equation 15b 


n.sd+ l. 


= [(-2"-!-1)/d Jd +1 =-2"-! + rem(2¥-!' + 1, d). 


nç is one of the least |d| admissible 


Because (14b) must hold for n = -d, and (14a) must hold for n = n,, we obtain, analogous to (4), 


Equation 16 


IPn = | a 
= Zg, 
ad 


don, 


Because m is to be the greatest integer satisfying (16), it is the next integer less than 2?/d—that is, 


Equation 17 


2° — d—rem(2", d) 


m = 
d 


Combining this with the left half of (16) and simplifying gives 


Equation 18 


The proof that the algorithm suggested by (17) and (18) is feasible and that the product is correct is similar to 
that for a positive divisor, and will not be repeated. A difficulty arises, however, in trying to prove that -2W Em 


So. To prove this, consider separately the cases in which d is the negative of a power of 2, or some other 
number. For d = -2k it is easy to show that n, = - 2W-1+1,p=W+k-1, and m=-2W~-1- 1 (which is within 


range). For d not of the form -2% it is straightforward to alter the earlier proof. 


For Which Divisors Is m(-d) #-m/(d)? 


By m(d) we mean the multiplier corresponding to a divisor d. If m(-d) = -m(d), code for division by a negative 
divisor can be generated by calculating the multiplier for |d|, negating it, and then generating code similar to 
that of the "divide by 7" case illustrated above. 


By comparing (18) with (6) and (17) with (5), it can be seen that if the value of n, for -d is the negative of that 


for d, then m(-d) = -m(d). Hence m(-d) m(d) can occur only when the value of n, calculated for the negative 
divisor is the maximum negative number, -2W - 1, Such divisors are the negatives of the factors of 2W-1+ 1. 
These numbers are fairly rare, as illustrated by the factorings below (obtained from Scratchpad). 

2154] 3. 11-331 

23i 4] 3. 715,827,883 

2634 1 = 33.19.43- 5419- 77,158,673,929 


For all these factors, m(-d) EmA). Proof sketch: For d > 0 we have n, = 2W -1 - d. Because rem(2W - 1, d) = d - 
1, (6) is satisfied by p = W - 1 and hence also by p = W. For d < 0, however, we have n, = -2W - 1t and rem(2W - 


1, d) = |d| - 1. Hence (18) is not satisfied for p = W - 1 or for p = W, so p > W. 


10-6 Incorporation into a Compiler 


For a compiler to change division by a constant into a multiplication, it must compute the magic number M and 
the shift amount s, given a divisor d. The straightforward computation is to evaluate (6) or (18) for p = W, W + 
1, ... until it is satisfied. Then, m is calculated from (5) or (17). M is simply a reinterpretation of m as a signed 
integer, ands =p - W. 


The scheme described below handles positive and negative d with only a little extra code, and it avoids 
doubleword arithmetic. 


Recall that n, is given by 


| 2W-1 4 rem(2™-1, d)-1, ifd>0, 


_ aW- | + rem{2™- l4 l, él}, if a <0. 


Hence |n,| can be computed from 


snami] O ifd>0, 
l, ifd<0, 


n| = t- 1- remgr, lel). 
jad 


The remainder must be evaluated using unsigned division, because of the magnitude of the arguments. We have 
written rem(t, |d|) rather than the equivalent rem(t, d), to emphasize that the program must deal with two 
positive (and unsigned) arguments. 


From (6) and (18), p can be calculated from 
Equation 19 


2” > |n,|(la| — rem(2?, |dl)), 


and then |m| can be calculated from (c.f. (5) and (17)): 


Equation 20 


dj) 


W + |ad| —rem(2’, 


lal 


|n| = 


Direct evaluation of rem(2?,|d|) in (19) requires "long division" (dividing a 2W-bit dividend by a W-bit divisor, 
giving a W-bit quotient and remainder), and in fact it must be unsigned long division. There is a way to solve 
(19), and in fact to do all the calculations, that avoids long division and can easily be implemented in a 
conventional HLL using only W-bit arithmetic. We do, however, need unsigned division and unsigned 
comparisons. 


We can calculate rem(2P, |d|) incrementally, by initializing two variables q and r to the quotient and remainder 
of 2P divided by |d| with p = 2W - 1, and then updating q and r as p increases. 


As the search progresses—that is, when p is incremented by 1—q and r are updated from (see Theorem D5(a)) 


q = 2*q; 

r = 2*rf; 

if (r >= abs(d)) { 
q= q +1; 


r =r - abs(d);} 


The left half of inequality (4) and the right half of (16), together with the bounds proved for m, imply that 

= |>” aw . 
g= L2 / lal] <2 *so q is representable as a W-bit unsigned integer. Also, 0 &r < |d| so r is representable 
as a W-bit signed or unsigned integer. (Caution: The intermediate result 2r can exceed 2W -1 - 1, so r should be 
unsigned and the comparison above should also be unsigned.) 


Next, calculate 6 = |d| - r. Both terms of the subtraction are representable as W-bit unsigned integers, and the 


result is also (1 Ss Sid), so there is no difficulty here. 


To avoid the long multiplication of (19), rewrite it as 


a 
The quantity ~ /|n,J is representable as a W-bit unsigned integer (similarly to (7), from (19) it can be shown 


atea. 
that ~ = |n,| lal and, for d = -2W-1, n,=-2W-1+ 1 and p =2W - 2 so that 
yp — IW-2 -f9W-1 a 
~ r|d 2 M2 I) <2 for W 23). Also, it is easily calculated incrementally (as p increases) in 
P, WI 
the same manner as for rem(2?, |d|). The comparison should be unsigned, for the case 2 rad Ze (which 
can occur, for large d). 
To compute m, we need not evaluate (20) directly (which would require long division). Observe that 
“i? remia” 
2 4 Jeil p . |e ial 
Wyn | > oO. Wyn |. , 
The loop closure test ‘ is awkward to evaluate. The quantity ‘lis available only in the form of a 


ry. 2°/|n,| 


quotient qı and a remainder may or may not be an integer (it is an integer only for d= 2W-2+ 1 and 


2" /ln,] = 


a few negative values of d). The test ~ may be coded as 


q,<8 1 (q =6&r, =0). 


The complete procedure for computing M and S from d is shown in Figure 10-1, coded in C, for W = 32. There 
are a few places where overflow can occur, but the correct result is obtained if overflow is ignored. 


Figure 10-1 Computing the magic number for signed division. 


struct ms {int M; // Magic number 
Int Ser // and shift amount. 


struct ms magic(int d) { // Must have 2 <= d <= 2**31-1 
// or -2**31 <= d <= -2. 


int p; 
unsigned ad, anc, delta, qi, ri, q2, r2, t; 
const unsigned two31 = O0x80000000; if 2° "31. 


struct ms mag; 


ad = abs(d); 


t = two3i + ((unsigned)d >> 31); 


anc = t - 1 - t%ad; // Absolute value of nc. 
p = 31; // Init. p. 
qi = two31/anc; // Init. q1 = 2°" p7 |ne} 
ri = two31 - gi*anc; // Init. r1 = rem(2**p, |ne|). 
q2 = two31/ad; // Init. q2 = 2°" p7 | a’. 
r2 = two31 - q2*ad; // Init. r2 = rem(2**p, |d|). 
do { 
p=p+ 41; 
gi. = 2* G1; // Update qi = 2**p/|nc|. 
ri = 2*r1; // Update ri = rem(2**p, |nc|l. 
if (r1 >= anc) { // (Must be an unsigned 
qi. = q1 + 1; // comparison here). 
rí = r1 - anc;} 
q2 = 2*q2; // Update q2 = 2**p/|d|. 
r2 = 2*r2; // Update r2 = rem(2**p, |d|. 
if (r2 >= ad) { // (Must be an unsigned 
q2 = q2 + 1; // comparison here). 


r2 = r2 - ad;} 
delta = ad - r2; 
} while (qi < delta || (q1 == delta && r1 == 0)); 


mag.M = q2 + 1; 
if (d < 0) mag.M = -mag.M; // Magic number and 
mag.S = p - 32; // shift amount to return. 


return mag; 


To use the results of this program, the compiler should generate the Li and muULNS instructions, generate the 
add if d > 0 and M < 0, or the Sub if d < 0 and M > 0, and generate the shr si if S > 0. Then, the Shri and 
final add must be generated. 


For W = 32, handling a negative divisor may be avoided by simply returning a precomputed result for d = 3 and 
d = 715,827,883, and using m(-d) = -m(d) for other negative divisors. However, that program would not be 
significantly shorter, if at all, than the one given in Figure 10-1. 


10-7 Miscellaneous Topics 


Theorem DC2. The least multiplier m is odd if p is not forced to equal W. 


Proof. Assume that Equations (1a) and (1b) are satisfied with least (not forced) integer p, and m even. Then 
clearly m could be divided by 2 and p could be decreased by 1, and (1a) and (1b) would still be satisfied. This 
contradicts the assumption that p is minimal. 


Uniqueness 


The magic number for a given divisor is sometimes unique (e.g., for W = 32, d = 7), but often it is not. In fact, 
experimentation indicates that it is usually not unique. For example, for W = 32, d = 6, there are four magic 
numbers: 


M = 715,827,833 ((23242)/6), s=0 
M = 1,431,655,766 ((2327+2)/3), s=l 
M = -1,431,655,765 ((2%41)/3-2%), » =2 
M = -1,431,655,764 ((23344)/3-2%2), 5 = 2, 


However, there is the following uniqueness property: 


Theorem DC3. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not 
forced to equal W. 


Proof. First consider the case d > 0. The difference between the upper and lower limits of inequality (4) is 2P/ 
dn,. We have already proved (7) that if p is minimal, then 2P/dn, >. Therefore, there can be at most two 


values of m satisfying (4). Let m be the smaller of these values, given by (5); then m + 1 is the other. 


Let pọ be the least value of p for which m + 1 satisfies the right half of (4) (pp is not forced to equal W). Then 


Ha + f ren PPa T) Wan g =Æ | 
i k i ail) +) << Fi 
d do n, 


This simplifies to 


2% > n (2d = rem(2", d)). 


Dividing by 2 gives 


Wo- ls n{d- Srem(2! d) 


foo a Fy — | 
Because Ten 2", d) S 2rem(2"~"", d) (by Theorem D5 on page 140), 


Wol > p (d — rem(2">~ |, d)), 


contradicting the assumption that pp is minimal. 


The proof for d < 0 is similar and will not be given. 
The Divisors with the Best Programs 


The program for d = 3, W = 32 is particularly short, because there is no add or shr si after the muLNs. 
What other divisors have this short program? 


We consider only positive divisors. We wish to find integers m and p that satisfy equations (1a) and (1b), and 
for which p = W and 0 Em Sow - 1, Because any integers m and p that satisfy equations (1a) and (1b) must 


also satisfy (4), it suffices to find those divisors d for which (4) has a solution with p = W and 0 Sm <2W- 1, 
All solutions of (4) with p = W are given by 


2V +kd-rem(2" d)  _ 


d 


m = N 


Combining this with the right half of (4) and simplifying gives 


Equation 21 


Ww 
rem(2, d) > kd- =. 
n. 


The weakest restriction on rem(2W, d) is with k = 1 and n, at its minimal value of 2W - 2. Hence we must have 


rem(2", d)>d—4; 


that is, d divides 2W + 1, 2W + 2, or 2W + 3. 
Now let us see which of these factors actually have optimal programs. 
If d divides 2W + 1, then rem(2, d) = d - 1. Then a solution of (6) is p = W, because the inequality becomes 


2” >n(d—-(d-1)) = na 


which is obviously true, because n, < 2W - 1, Then in the calculation of m we have 


2" +d—-(d-1) _ 2%+1 
d d 


m = 


which is less than 2W -1 for d 23 (d JÉ? because d divides 2W + 1). Hence all the factors of 2W + 1 have 
optimal programs. 


Similarly, if d divides 2W + 2, then rem(2W, d) = d - 2. Again, a solution of (6) is p = W, because the inequality 
becomes 


2W > nd- {d-2)) = 2n,, 


which is obviously true. Then in the calculation of m we have 


Si a IW d-(d-—2) = 2W47 
d d ` 


which exceeds 2W -1 - 1 for d = 2, but which is less than or equal to 2W -1 - 1 for W 23, d >3 (the case W= 3 
and d = 3 does not occur, because 3 is not a factor of 23 + 2 = 10). Hence all factors of 2W + 2, except for 2 and 
the cofactor of 2, have optimal programs. (The cofactor of 2 is (2W + 2)/2, which is not representable as a W-bit 
signed integer). 


If d divides 2W + 3, the following argument shows that d does not have an optimal program. Because rem(2W, 
d) = d - 3, inequality (21) implies that we must have 


a 


"eS id-d+3 


for some k = 1, 2, 3, .... The weakest restriction is with k = 1, so we must have n, < 2W/3. 


From (3a), ne 2W-1. d, or d 2W-1. no. Hence it is necessary that 


Also, since 2, 3, and 4 do not divide 2W + 3, the smallest possible factor of 2W + 3 is 5. Hence the largest 
possible factor is (2W + 3)/5. Thus, if d divides 2W + 3 and d has an optimal program, it is necessary that 


2 cde t3 


Taking reciprocals of this with respect to 2W + 3 shows that the cofactor of d, (2W + 3)/d, has the limits 


22W43 (2 43)-6 


j= = 6+15 


d 2w 2w 


For W 25, this implies that the only possible cofactors are 5 and 6. For W < 5, it is easily verified that there are 
no factors of 2W + 3. Because 6 cannot be a factor of 2W + 3, the only possibility is 5. Therefore, the only 
possible factor of 2W + 3 that might have an optimal program is (2W + 3)/5. 


For d = (2W + 3)/5, 


| aW-l ( t4 
i = A m 
i ml 5 


hea 
= 
+ 
ud 
e 
| 


For W 224, 


WI 


a ee 
(2" 4 3)/5 


ae a 


sO 


This exceeds (2/3), so d = (2W + 3)/5 does not have an optimal program. Because for W < 4 there are no 
factors of 2W + 3, we conclude that no factors of 2W + 3 have optimal programs. 


In summary, all the factors of 2W + 1 and of 2 + 2, except for 2 and (2 + 2)/2, have optimal programs, and 
no other numbers do. Furthermore, the above proof shows that algorithm magic (Figure 10-1 on page 174) 
always produces the optimal program when it exists. 


Let us consider the specific cases W = 16, 32, and 64. The relevant factorizations are shown below. 


2164] 


21642 


65537 (prime) 22 4] 641 - 6,700,417 
2-37-11-331 27242 = 2-3-715,827,883 
2 4 1 = 274,177 - 67,280,421,310,771 

2542 = 2.35.19. 43-5419 - 77,158,673,929 


Hence we have the results that for W = 16, there are 20 divisors that have optimal programs. The ones less than 
100 are 3, 6, 9, 11, 18, 22, 33, 66, and 99. 


For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 715,827,883, and 1,431,655,766. 


For W = 32, there are 126 such divisors. The ones less than 100 are 3, 6, 9, 18, 19, 27, 38, 43, 54, 57, and 86. 


10-8 Unsigned Division 


Unsigned division by a power of 2 is of course implemented by a single shift right logical instruction, and 
remainder by and immediate. 


It might seem that handling other divisors will be simple: Just use the results for signed division with d > 0, 


omitting the two instructions that add 1 if the quotient is negative. We will see, however, that some of the 
details are actually more complicated in the case of unsigned division. 


Unsigned Divide by 3 
For a non-power of 2, let us first consider unsigned division by 3 on a 32-bit machine. Because the dividend n 
can now be as large as 222 - 1, the multiplier (232 + 2)/3 is inadequate, because the error term 2n/3 - 232 (see 


"divide by 3" example above) can exceed 1/3. However, the multiplier (233 + 1)/3 is adequate. The code is 


li M,OxAAAAAAAB Load magic number, (2**33+1)/3. 


mulhu q,M,n q = floor(M*n/2**32). 
shri q,q,1 

muli t,q,3 Compute remainder from 
sub ryt r=n - q*3. 


An instruction that gives the high-order 32 bits of a 64-bit unsigned product is required, which we show above 
as mulhu. 


To see that the code is correct, observe that it computes 


{2341 | _ la n 
a=| 3 lhe Eee 


For 0 Sn < 232, 0 SnB - 233) < 1/3, so by Theorem D4, # = Ln/3]. 


In computing the remainder, the multiply immediate can overflow if we regard the operands as signed integers, 
but it does not overflow if we regard them and the result as unsigned. Also, the subtract cannot overflow, 
because the result is in the range 0 to 2, so the remainder is correct. 


Unsigned Divide by 7 


For unsigned division by 7 on a 32-bit machine, the multipliers (232 + 3)/7, (233 + 6)/7, and (234 + 5)/7 are all 
inadequate because they give too large an error term. The multiplier (235 + 3)/7 is acceptable, but it's too large 
to represent in a 32-bit unsigned word. We can multiply by this large number by multiplying by (235 + 3)/7 - 
232 and then correcting the product by inserting an add. The code is 


li M, 0x24924925 Magic num, (2**35+3)/7 - 2**32. 


mulhu q,M,n q = floor (M*n/2**32). 

add q, q, n Can overflow (sets carry). 
shrxi q,q,3 Shift right with carry bit. 
muli t,q,7 Compute remainder from 

sub r,n,t r =n - q*7. 


Here we have a problem: The add can overflow. To allow for this, we have invented the new instruction shift 
right extended immediate (Shr Xi), which treats the carry from the add and the 32 bits of register q as a 


single 33-bit quantity, and shifts it right with 0-fill. On the Motorola 68000 family, this can be done with two 
instructions: rotate with extend right one position, followed by a logical right shift of three (rT OXY actually uses 


the X bit, but the add sets the X bit the same as the carry bit). On most machines, it will take more. For 


example, on PowerPC it takes three instructions: clear rightmost three bits of g, add carry to q, and rotate right 
three positions. 


With Shr Xi implemented somehow, the code above computes 


For 0 Sn < 232, 0S 3n/(7 - 235) < 1/7, so by Theorem D4, 4 = Ln/7 J. 


Granlund and Montgomery [GM] have a clever scheme for avoiding the Shr X11 instruction. It requires the 
same number of instructions as the above three-instruction sequence for Shr X1, but it employs only 


elementary instructions that almost any machine would have, and it does not cause overflow at all. It uses the 
identity 


= fa 
Applying this to our problem, with = LM 12 oire 0 &M< 232, the subtraction will not overflow 
because 


i-g=Aa- Ea = |” - a | = (1 e. )} 
= Ae aa 


so that clearly 0 Sn - q < 232, Also, the addition will not overflow, because 


[=+ 7 524 a [=] 
2 2 2 


and 0 Sn, q < 222, 


Using this idea gives the following code for unsigned division by 7: 


li M, se aa Magic num, (2**354+3)77 = 2°" 32. 
mulhu q,M, q = floor(M*n/2**32). 

sub t,n * t=n - q. 

shri t.t,1 t = (n - q)/2. 

add ttg t = (n - q)/2 +q= (n + q)/2. 
shri q,t,2 q = (n+Mn/2**32)/8 = floor(n/7). 
muli t,q,7 Compute remainder from 

sub r,t r=n - q*7. 


For this to work, the shift amount for the hypothetical Shr xi instruction must be greater than 0. It can be 


shown that if d > 1 and the multiplier m 932 (so that the Shr xi instruction is needed), then the shift amount 
is greater than 0. 


10-9 Unsigned Division by Divisors 21 


Given a word size W 21 and a divisor d, 1 Sa < 2W, we wish to find the least integer m and integer p such that 


Equation 22 


WT) = 1") for OSn <2" 
2P d i 


with 0 Sm <2W+1 and p >w. 


In the unsigned case, the magic number M is given by 


Ht, if O<m< 2”, 


m— 2”) if Dem 241, 


M = 


Because (22) must hold for " = d, Lmd/2? | = 1, or 


Equation 23 


a, 


2 


As in the signed case, let n, be the largest value of n such that rem(n,, d) = d - 1. It can be calculated from 
ne = L2¥/dld-1 = 2¥-rem2, d- 1. 
Equation 24a 


2 -dsa.s2"-1, 


and 
Equation 24b 


n.zd—I. 


These imply that ne ZW -1, 


Because (22) must hold for n = n,, 
mn, |  |a.| _ a.-(d-1) 
w | tal il i 


or 


mn, Hetl 


T r 
Fa [i 


Combining this with (23) gives 
Equation 25 


oP IPn. | 
—smc—— 
d do n, 


Because m is to be the least integer satisfying (25), it is the next integer greater than or equal to 2P/d—that is, 


Equation 26 


o Pgd- l rem?" l, 4) 


él 


Combining this with the right half of (25) and simplifying gives 


Equation 27 


2° >n dd — 1 —rem(2?— 1, d)). 


The Algorithm (Unsigned) 


Thus, the algorithm is to find by trial and error the least p >w satisfying (27). Then m is calculated from (26). 


This is the smallest possible value of m satisfying (22) with p Žv. As in the signed case, if (27) is true for 
some value of p, then it is true for all larger values of p. The proof is essentially the same as that of Theorem 
DC1, except Theorem D5(b) is used instead of Theorem D5(a). 


Proof That the Algorithm Is Feasible (Unsigned) 


We must show that (27) always has a solution and that 0 Sm < 2W+1, 


Because for any nonnegative integer x there is a power of 2 greater than x and less than or equal to 2x + 1, from 
(27), 


nea — 1 =rem(2" = 1, d) <2" Ss 2nd - 1 -—rem(2° = 1, d)) + 1. 


Because 0 Srem(2? -1 d) <a - 1, 
Equation 28 


p 
1&2" &2n d- l)+l. 


Because no, d Sw- 1 this becomes 


122? <9(2"_ 1)(2"_ 2) 41, 


or 
Equation 29 


Qepe2w. 


Thus, (27) always has a solution. 
If p is not forced to equal W, then from (25) and (28), 


l 2n.(a@—1l)+ln.t+l 
‘i nmn Imee 


== < : 

él ü él Ht. 

| 2a—2+ 1/n, 

lamg — i, +1), 
i 


lsm<2(n,+ 1)s2"*!. 


If p is forced to equal W, then from (25), 


aw aW + | 
—[—=nc = S 
a al Ne 


Because 1 Sd 2-1 and Ne 2W-1, 


Ww aww- 4 | 
<n <= — —— 
AW awe 


2oms2"4 1, 


Hence in either case m is within limits for the code schema illustrated by the "unsigned divide by 7" example. 


Proof That the Product Is Correct (Unsigned) 
We must show that if p and m are calculated from (27) and (26), then (22) is satisfied. 


Equation (26) and inequality (27) are easily seen to imply (25). Inequality (25) is nearly the same as (4), and the 


remainder of the proof is nearly identical to that for signed division with n 2o. 


10-10 Incorporation into a Compiler (Unsigned) 


There is a difficulty in implementing an algorithm based on direct evaluation of the expressions used in this 


proof. Although p ow, which is proved above, the case p = 2W can occur (e.g., for d = 2W - 2 with W = 4), 
When p = 2W,, it is difficult to calculate m, because the dividend in (26) does not fit in a 2W-bit word. 


However, it can be implemented by the "incremental division and remainder" technique of algorithm magic. 
The algorithm is given in Figure 10-2 for W = 32. It passes back an indicator a, which tells whether or not to 


generate an add instruction. (In the case of signed division, the caller recognizes this by M and d having 
opposite signs.) 


Figure 10-2 Computing the magic number for unsigned division. 


struct mu {unsigned M; // Magic number, 
int a; // "add" indicator, 
int s;}; // and shift amount. 


struct mu magicu(unsigned d) { 
// Must have 1 <= d <= 2**32-1. 
int p; 
unsigned nc, delta, qi, ri, q2, r2; 
struct mu magu; 


magu.a = 0; // Initialize "add" indicator. 
nc = -1 - (-d)%d; // Unsigned arithmetic here. 
p = 31; // Init. p. 
qi = 0x80000000/nc; // Init. q1 = 2**p/nc. 
rl = 0x80000000 - qi*nc;// Init. ri = rem(2**p, nc). 
q2 = OxX7FFFFFFF/d; // Init. q2 = (2**p - 1)/d. 
r2 = OxX7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d). 
do { 
p=p+1; 
if (r1 >= nc - r1) { 
qi = 2*q1 + 1; // Update qi. 
rl = 2*r1 - nc;} // Update r1. 
else { 
q1 = 2*qi; 
rl = 2*r1;} 


if (r2 + 1 >=d - r2) { 
if (q2 >= Ox7FFFFFFF) magu.a = 1; 


q2 = 2*q2 + 1; // Update q2. 

r2 = 2*r2 + 1 - d;} // Update r2. 
else { 

if (q2 >= 0x80000000) magu.a = 1; 

q2 = 2"q2; 


r2 = 2*r2 + 1;} 
delta =d - 1 - r2; 
} while (p < 64 && 
(qi < delta || (qi == delta && r1 == 0))); 


magu.M = q2 + 1; // Magic number 
magu.s = p - 32; // and shift amount to return 
return magu; // (magu.a was set above). 


Some key points in understanding this algorithm are as follows: 
e Unsigned overflow can occur at several places and should be ignored. 


© no = 2W-rem(2, d) - 1 = (2W - 1) - rem(2W - d, d). 


e The quotient and remainder of dividing 2? by n, cannot be updated in the same way as is done in 
algorithm magic, because here the quantity 2* r1 can overflow. Hence the algorithm has the test "i f 
(ri>=nc - r1)," whereas "1f (2*r1 >= nc)" would be more natural. A similar remark 
applies to computing the quotient and remainder of 2P - 1 divided by d. 


© 08 Sa- 1, so 6 is representable as a 32-bit unsigned integer. 


. om = (2? 4+d-—1-rem(2’-1, d/d = (2? -1)/d] 41 = g41. 
e The subtraction of 2W when the magic number M exceeds 2W - 1 is not explicit in the program; it 
occurs if the computation of q2 overflows. 


e The "add" indicator, magu . a, cannot be set by a straightforward comparison of M to 222, or of 
q2 to 222 - 1, because of overflow. Instead, the program tests q2 before overflow can occur. If q2 
ever gets as large as 232 - 1, so that M will be greater than or equal to 232, then magu . a is set equal to 
1. If q2 stays below 232 - 1, then magu . a is left at its initial value of 0. 


e Inequality (27) is equivalent to 2?/n, > 6. 


e The loop test needs the condition "p < 64" because without it, overflow of G1 would cause the 
program to loop too many times, giving incorrect results. 


10-11 Miscellaneous Topics (Unsigned) 


Theorem DC2U. The least multiplier m is odd if p is not forced to equal W. 


Theorem DC3U. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not 
forced to equal W. 


The proofs of these theorems follow very closely the corresponding proofs for signed division. 


The Divisors with the Best Programs (Unsigned) 


For unsigned division, to find the divisors (if any) with optimal programs of two instructions to obtain the 
quotient (Li, muLhu), we can do an analysis similar to that of the signed case (see "The Divisors with the 
Best Programs" on page 175). The result is that such divisors are the factors of 2W or 2W + 1, except for d = 1. 
For the common word sizes, this leaves very few nontrivial divisors that have optimal programs for unsigned 
division. For W = 16, there are none. For W = 32, there are only two: 641 and 6,700,417. For W = 64, again 
there are only two: 274,177 and 67,280,421,310,721. 


The case d = 2k, k = 1, 2, ..., deserves special mention. In this case, algorithm magicu produces p = W (forced), 
m = 232-k, This is the minimal value of m, but it is not the minimal value of M. Better code results if p = W + k 
is used, if sufficient simplifications are done. Then, m = 2W, M = 0, a = 1, ands - k. The generated code 
involves a multiplication by 0 and can be simplified to a single shift right k instruction. As a practical matter, 
divisors that are a power of 2 would probably be special-cased without using magicu. (This phenomenon does 
not occur for signed division, because for signed division m cannot be a power of 2. Proof: For d > 0, inequality 
(4) combined with (3b) implies that d - 1 < 2P/m < d. Therefore, 2?/m cannot be an integer. For d < 0, the result 
follows similarly from (16) combined with (15b)). 


For unsigned division, the code for the case m >w is considerably worse than the code for the case m < 2W, if 
the machine does not have Shr X1. Hence it is of interest to have some idea of how often the large multipliers 


arise. For W = 32, among the integers less than 100, there are 31 "bad" divisors: 1, 7, 14, 19, 21, 27, 28, 31, 35, 
37, 38, 39, 42, 45, 53, 54, 55, 56, 57, 62, 63, 70, 73, 74, 76, 78, 84, 90, 91, 95, and 97. 


Using Signed in Place of Unsigned Multiply, and the Reverse 
If your machine does not have mULNu, but it does have mULNS (or signed long multiplication), the trick given 


in "High-Order Product Signed from/to Unsigned," on page 132, might make our method of doing unsigned 
division by a constant still useful. 


That section gives a seven-instruction sequence for getting mulhu from muLhs. However, for this 
application it simplifies, because the magic number M is known. Thus, the compiler can test the most 


significant bit of the magic number, and generate code such as the following for the operation "mulhu q, M, 
n." Here t denotes a temporary register. 


M3, = © M3, = 1 
mulhs g,M,n mulhs g,M,n 
shrsi t,n,31 shrsi t,n,31 
and t,t,M and t,t,M 
add qzg;t add t,t,n 

add q,q,t 


Accounting for the other instructions used with mu Lhu, this uses a total of six to eight instructions to obtain 
the quotient of unsigned division by a constant, on a machine that does not have unsigned multiply. 


This trick may be inverted, to get mulhs in terms of mulhu. The code is the same as that above except the 
mulhs is changed to mulhu and the final add in each column is changed to Sub. 


A Simpler Algorithm (Unsigned) 


Dropping the requirement that the magic number be minimal yields a simpler algorithm. In place of (27) we can 
use 


Equation 30 


2 > 2d — 1 -—rem(2” — 1, d), 


and then use (26) to compute m, as before. 


It should be clear that this algorithm is formally correct (that is, that the value of m computed does satisfy 
equation (22)), because its only difference from the previous algorithm is that it computes a value of p that, for 
some values of d, is unnecessarily large. It can be proved that the value of m computed from (30) and (26) is 
less than 2W + 1, We omit the proof and simply give the algorithm (Figure 10-3). 


Figure 10-3 Simplified algorithm for computing the magic number, unsigned division. 
struct mu {unsigned M; // Magic number, 


int a; // "add" indicator, 
int s;}; // and shift amount. 


struct mu magicu2(unsigned d) { 
// Must have 1 <= d <= 2**32-1. 
int p; 
unsigned p32, q, r, delta; 
struct mu magu; 
magu.a = 0; // Initialize "add" indicator. 
p = 31; // Initialize p. 
q = ©x7FFFFFFF/a; // Initialize q = (2**p - 1)/d. 
r = OX7FFFFFFF - q*d; // Init. r = rem(2**p - 1, d). 
do { 
p=p+d,; 
if (p == 32) p32 = 1; // Set p32 = 2**(p-32). 
else p32 = 2*p32; 
if (r+1>=d-pr)f{ 
if (q >= OxX7FFFFFFF) magu.a = 1; 
q = 2*q + 1; // Update q. 
r= 2*r +1 = a) // Update r. 
} 
else { 
if (q >= 0x80000000) magu.a = 1; 
g = 2*0; 
r27 +1 
J 
delta =d - 1 - r; 
} while (p < 64 && p32 < delta); 
magu.M = q + 1; // Magic number and 
magu.s = p - 32; // shift amount to return 


return magu; // (magu.a was set above). 


Alverson [Alv] gives a much simpler algorithm, discussed in the next section, but it gives somewhat large 
values for m. The point of algorithm magicu2 is that it nearly always gives the minimal value for m when d 


SoW-1. For W= 32, the smallest divisor for which magicu2 does not give the minimal multiplier is d = 
102,807, for which magicu calculates m = 2,737,896,999 and magicu2 calculates m = 5,475,793,997. 


There is an analog of magicu2 for signed division by positive divisors, but it does not work out very well for 
signed division by arbitrary divisors. 


10-12 Applicability to Modulus and Floor Division 


It might seem that turning modulus or floor division by a constant into multiplication would be simpler, in that 
the "add 1 if the dividend is negative" step could be omitted. This is not the case. The methods given above do 
not apply in any obvious way to modulus and floor division. Perhaps something could be worked out; it might 
involve altering the multiplier m slightly depending upon the sign of the dividend. 


10-13 Similar Methods 


Rather than coding algorithm magic, we can provide a table that gives the magic numbers and shift amounts for 
a few small divisors. Divisors equal to the tabulated ones multiplied by a power of 2 are easily handled as 
follows: 


1. Count the number of trailing 0's in d, and let this be denoted by k. 
2. Use as the lookup argument d/2k (shift right k). 
3. Use the magic number found in the table. 
4. Use the shift amount found in the table, increased by k. 
Thus, if the table contains the divisors 3, 5, 25, and so on, divisors of 6, 10, 100, and so forth can be handled. 


This procedure usually gives the smallest magic number, but not always. The smallest positive divisor for 
which it fails in this respect for W = 32 is d = 334,972, for which it computes m = 3,361,176,179 and s = 18. 
However, the minimal magic number for d = 334,972 is m = 840,294,045, with s = 16. The procedure also fails 
to give the minimal magic number for d = -6. In both these cases, output code quality is affected. 


Alverson [Alv] is the first known to the author to state that the method described here works with complete 
accuracy for all divisors. Using our notation, his method for unsigned integer division by d is to set the shift 


amount Ë = W+ [log 5 d], 


n+d = |mn/?! J that is, multiply and shift right). He proves that the multiplier m is less than 2W * 1, and 
that the method gets the exact quotient for all n expressible in W bits. 


_ [ary 
and the multiplier “! = [2° /d ana then do the division by 


Alverson's method is a simpler variation of ours in that it doesn't require trial and error to determine p, and is 
thus more suitable for building in hardware, which is his primary interest. His multiplier m, however, is always 
greater than or equal to 2, and thus for the software application always gives the code illustrated by the 
"divide by 7" example (that is, always has the add and Shr Xi, or the alternative four instructions). Because 


most small divisors can be handled with a multiplier less than 2W it seems worthwhile to look for these cases. 


For signed division, Alverson suggests finding the multiplier for |d| and a word length of W - 1 (then 2W -1 Em 
< 2W), multiplying the dividend by it, and negating the result if the operands have opposite signs. (The 
multiplier must be such that it gives the correct result when the dividend is 2W - 1, the absolute value of the 
maximum negative number). It seems possible that this suggestion might give better code than what has been 


given here in the case that the multiplier m ow, Applying it to signed division by 7 gives the following code, 
where we have used the relation -x = x + 1 to avoid a branch: 


abs an,n 
li M, 0x92492493 Magic number, (2**34+5)/7. 


mulhu q,M,an q = floor(M*an/2**32). 
shri q,q,2 

shrsi t,n,31 These three instructions 
xor q, q,t negate q if n is 

sub q,q,t negative. 


This is not quite as good as the code we gave for signed division by 7 (six vs. seven instructions), but it would 
be useful on a machine that has abs and mulhu but not muLhs. 


10-14 Sample Magic Numbers 


Table 10-1. Some Magic Numbers for W = 32 


M(hex) 


o — 1 


-3 55555555 1 


“5k 7FFFFFFF 


1 = - 


-1 


9k 80000001 

3 55555556 0 
5 6666666/ 1 
6 2AAAAAAB 0 
7 92492493 2 
9 38E38E39 1 
10 6666666 / 2 
11 2E8BA2E9 1 
12 2AAAAAAB 1 
25 51EB851F 3 


232-K 

AAAAAAAB 
CCCCCCCD 
24924925 
38E38E39 
CCCCCCCD 
BA2ZE8BA3 


AAAAAAAB 


51EB851F 


Unsigned 


M(hex) 


a|s 
1 0 
aa 
O ll 
0 2 
0 2 
1 B 
0 ll 
0 B 
0 B 
0 B 


125 10624DD3 


Ps ee BAD 


Table 10-2. Some Magic Numbers for W = 64 


Signed Unsigned 
d M (hex) S M (hex) ajs 
i ie 1 
-3 5555559555555355 
-2k [FFFFFFFFFFFFFFF 


2k 8000000000000001 


3 5555559555555956 


5 [6666666666666667 |  PoCCCUCCCCCCCCCCD 0 2 
6 2AAAAAAAAAAAAAAB 0 AAAAAAAAAAAAAAAB 0 2 
7 4924924924924925 1 (2492492492492493 1 B 
9 eae ea 0  E38E38E38E38E38F PP 
10 (6666666666666667 2 |CCCCCCCCCCCCCCCD 0 B 
11 [2E8BA2E8BA2E8BA3 i preenzeseneees PT 
12 |2AAAAAAAAAAAAAAB 1 AAAAAAAAAAAAAAAB 0 B 


25 A3D/0A3D/0A3D/0B 4 pee PP 


10-15 Exact Division by Constants 


By "exact division," we mean division in which it is known beforehand, somehow, that the remainder is 0. 
Although this situation is not common, it does arise, for example, when subtracting two pointers in the C 
language. In C, the result of p - q, where p and q are pointers, is well defined and portable only if p and q point 
to objects in the same array [H&S, sec. 7.6.2]. If the array element size is s, the object code for the difference p 


- q computes (p - q)/s. 


The material in this section was motivated by [GM, sec. 9]. 


The method to be given applies to both signed and unsigned exact division, and is based on the following 
theorem. 


< 


Theorem MI. If a and m are relatively prime integers, then there exists an integera , 1 =a < m, such that 


da= | (mod m). 


Thus a is a multiplicative inverse of a, modulo m. There are several ways to prove this theorem; three proofs 
are given in [NZM, 52]. The proof below requires only a very basic familiarity with congruences. 


Proof. We will prove something a little more general than the theorem. If a and m are relatively prime (and 
hence nonzero), then as x ranges over all m distinct values modulo m, ax takes on all m distinct values modulo 
m. For example, if a = 3 and m = 8, then as x ranges from 0 to 7, ax = 0, 3, 6, 9, 12, 15, 18, 21 or, reduced 
modulo 8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all values from 0 to 7 are present in the last sequence. 


To see this in general, assume that it is not true. Then there exist distinct integers that map to the same value 


when multiplied by a; that is, there exist x and y, with x Æ, (mod m), such that 


ax=ay (mod m). 


But then there exists an integer k such that 


ax- dy = km, or 


alx- y) = km, 


Because a has no factor in common with m, it must be that x - y is a multiple of m; that is, 


x=y (mod m). 


This contradicts the hypothesis. 


Now, because ax takes on all m distinct values modulo m, as x ranges over the m values, it must take on the 
value 1 for some x. 


The proof shows that there is only one value (modulo m) of x such that ax =1 (mod m)—that is, the 
multiplicative inverse is unique, apart from additive multiples of m. It also shows that there is a unique (modulo 
m) integer x such that ax =b (mod m)where b is any integer. 


As an example, consider the case m = 16. Then 3 = Il, because 3:11 = 33 =1 (mod 16). We could just as 


well take 3 = —5, because 3-(-5) = -15 =1 (mod 16). Similarly, -3 = 5 sbecause (-3):5 = -15 =1 (mod 16). 


These observations are important because they show that the concepts apply to both signed and unsigned 


numbers. If we are working in the domain of unsigned integers on a 4-bit machine, we take 3 = 11. In the 


domain of signed integers, we take 3 = =5. But 11 and -5 have the same representation in two's-complement 
(because they differ by 16), so the same computer word contents can serve in both domains as the 
multiplicative inverse. 


The theorem applies directly to the problem of division (signed and unsigned) by an odd integer d on a W-bit 
computer. Because any odd integer is relatively prime to 2, the theorem says that if d is odd, there exists an 
integer d (unique in the range 0 to 2 - 1 or in the range -2W - 1 to 2W -1 - 1) such that 


dd=1 (mod 2"), 


Hence for any integer n that is a multiple of d, 


n 


d 


a ddy=and (mod 2), 
€ 


In other words, n/d can be calculated by multiplying n by d , and retaining only the rightmost W bits of the 
product. 


If the divisor d is even, let d = d, - 2k, where d, is odd and k 21. Then, simply shift n right k positions (shifting 


out 0's), and then multiply by dy (the shift could be done after the multiplication as well). 


Below is the code for division of n by 7, where n is a multiple of 7. This code gives the correct result whether it 
is considered to be signed or unsigned division. 


li M, OXB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7. 
mul q,M,n q = n/7. 


Computing the Multiplicative Inverse by the Euclidean Algorithm 


How can we compute the multiplicative inverse? The standard method is by means of the "extended Euclidean 
algorithm." This is briefly discussed below as it applies to our problem, and the interested reader is referred to 
[NZM, 13] and to [Knu2, sec. 4.5.2] for a more complete discussion. 


Given an odd divisor d, we wish to solve for x 


dx=1 (mod mj, 


where, in our application, m = 2W and W is the word size of the machine. This will be accomplished if we can 
solve for integers x and y (positive, negative, or 0) the equation 


dx+ my = l. 


Toward this end, first make d positive by adding a sufficient number of multiples of m to it. (d and d + km have 
the same multiplicative inverse.) Second, write the following equations (in which d, m > 0): 


di-ly+ mil) = m-ad (1) 
dily + m0) = d. (il) 


If d = 1, we are done, because (ii) shows that x = 1. Otherwise, compute 
g= m= l 
l 


Third, multiply Equation (ii) by q and subtract it from (i). This gives 


d(—l—-g)+m(1) = m-d-gd = rem(m-—d, dh. 


This equation holds because we have simply multiplied one equation by a constant and subtracted it from 
another. If rem(m - d, d) = 1, we are done; this last equation is the solution and x = - 1 - q. 


Repeat this process on the last two equations, obtaining a fourth, and continue until the right-hand side of the 
equation is 1. The multiplier of d, reduced modulo m, is then the desired inverse of d. 


Incidentally, if m - d < d, so that the first quotient is 0, then the third row will be a copy of the first, so that the 
second quotient will be nonzero. Furthermore, most texts start with the first row being 


d(QO)+m(1) = am, 


but in our application m = 2W is not representable in the machine. 


The process is best illustrated by an example. Let m = 256 and d = 7. Then the calculation proceeds as follows. 


To get the third row, note that # = L249/7 J = 35. 


7(-1) + 256( 1) = 249 
7( 1) + 256( 0) =7 


7(-36) + 256( 1) = 4 
7( 37) + 256(-1) = 3 
7(-73) + 256( 2) =1 


Thus, the multiplicative inverse of 7, modulo 256, is -73 or, expressed in the range 0 to 255, is 183. Check: 
7:183 = 1281 =1 (mod 256). 


From the third row on, the integers in the right-hand column are all remainders with respect to the number 


above it as a divisor (d being the dividend), so they form a sequence of strictly decreasing nonnegative integers. 
Therefore, the sequence must end in 0 (as the above would if carried one more step). Furthermore, the value 
just before the 0 must be 1, for the following reason. Suppose the sequence ends in b followed by 0, with b 


1. Then, the integer preceding the b must be a multiple of b, let's say k,b, for the next remainder to be 0. The 
integer preceding k,b must be of the form k,k5b + b, for the next remainder to be b. Continuing up the 


sequence, every number must be a multiple of b, including the first two (in the positions of the 249 and the 7 in 
the above example). But this is impossible, because the first two integers are m - d and d, which are relatively 
prime. 


This constitutes an informal proof that the above process terminates, with a value of 1 in the right-hand column, 
and hence it finds the multiplicative inverse of d. 


To carry this out on a computer, first note that if d < 0 we should add 2 to it. But with two's-complement 
arithmetic it is not necessary to actually do anything here; simply interpret d as an unsigned number regardless 
of how the application interprets it. 


The computation of q must use unsigned division. 


Observe that the calculations can be done modulo m, because this does not change the right-hand column (these 
values are in the range 0 to m - 1 anyway). This is important, because it enables the calculations to be done in 
"single precision," using the computer's modulo-2 unsigned arithmetic. 


Most of the quantities in the table need not be represented. The column of multiples of 256 need not be 
represented, because in solving dx + my = 1, we do not need the value of y. There is no need to represent d in 
the first column. Reduced to its bare essentials, then, the calculation of the above example is carried out as 
follows: 


255 249 
1 7 
220 4 
37 3 
183 1 


A C program for performing this computation is shown in Figure 10-4. 


Figure 10-4 Multiplicative inverse modulo 232 by the Euclidean algorithm. 


unsigned mulinv(unsigned d) { // d must be odd. 
unsigned x1, Vi, x2, v2, x3, V3, q; 


x1 OxFFFFFFFF; vi = -d; 
2S L; v2 = d; 


while (v2 > 1) { 


q = vi/v2; 

x3 = x1 = q*x2; v3 = vi - q*v2; 
x1 = x2; v1 = v2; 

x2 = x3; v2 = V3; 


} 


return(x2); 


The reason the loop continuation condition is (V2 > 1 ) rather than the more natural (v2 != 1) is that if the 


latter condition were used, the loop would never terminate if the program were invoked with an even argument. 
It is best that programs not loop forever even if misused. (If the argument d is even, V2 never takes on the 


value 1, but it does become 0.) 


What does the program compute if given an even argument? As written, it computes a number x such that dx 
=0 (mod 232), which is probably not useful. However, with the minor modification of changing the loop 
continuation condition to (V2 != ©) and returning X1 rather than X2, it computes a number x such that dx 


=g (mod 232) where g is the greatest common divisor of d and 232—hat is, the greatest power of 2 that divides 
d. The modified program still computes the multiplicative inverse of d for d odd, but it requires one more 
iteration than the unmodified program. 


As for the number of iterations (divisions) required by the above program, for d odd and less than 20, it requires 
a maximum of 3 and an average of 1.7. For d in the neighborhood of 1000, it requires a maximum of 11 and an 
average of about 6. 


Computing the Multiplicative Inverse by Newton's Method 


It is well known that, over the real numbers, 1/d, for d Ho, can be calculated to ever increasing accuracy by 
iteratively evaluating 


Equation 31 


Kael = Xk dX) 


provided the initial estimate xp is sufficiently close to 1/d. The number of digits of accuracy approximately 


doubles with each iteration. 


It is not so well known that this same formula can be used to find the multiplicative inverse in the domain of 
modular arithmetic on integers! For example, to find the multiplicative inverse of 3, modulo 256, start with xo = 


1 (any odd number will do). Then, 


x, = (2-3-1) = -l, 
Kg = -1(2-4(-1)) = -3, 
x, = -5(2-3(-5)) = -85, 
xX, = —85(2-—3(-85)) = -218545 =—85 (mod 256). 


The iteration has reached a fixed point modulo 256, so -85, or 171, is the multiplicative inverse of 3 (modulo 
256). All calculations can be done modulo 256. 


Why does this work? Because if x,, satisfies 


dx, = l (mod nt) 


and if x, + ; is defined by (31), then 


dx, ,, = 1 (mod m*). 


To see this, let dx, = 1+ km. Then 


ely 


‘nal = dx,,(2 ~ dX, ) 


(1 + knry(2 — (1 + km)) 
(1 + kmi l- ki) 
| —kem- 


| (mod m+). 


Ill 


In our application, m is a power of 2, say 2N. In this case, if 


dx, =1 (mod 2%), then 


dx ,,=1 (mod 27%). 


In a sense, if x, is regarded as a sort of approximation to d , then each iteration of (31) doubles the number of 


bits of "accuracy" of the approximation. 


It happens that, modulo 8, the multiplicative inverse of any (odd) number d is d itself. Thus, taking xọ = dis a 


reasonable and simple initial guess at d . Then, (31) will give values of x4, X>, ..., such that 


dx,=1 (mod 2°), 
dx, =] (mod 2!*), 
dx,=1 (mod 2*4), 


dv,=1 (mod 2%), and so on. 


Thus, four iterations suffice to find the multiplicative inverse modulo 232 (if x =1 (mod 248) then x =1 (mod 


2n) for n 48). This leads to the C program in Figure 10-5, in which all computations are done modulo 2°2. 


Figure 10-5 Multiplicative inverse modulo 232 by Newton's method. 


unsigned mulinv(unsigned d) { // d must be odd. 
unsigned xn, t; 


xn = d; 
loop: t = d*xn; 
if (t == 1) return xn; 
xn = xn*(2 - t); 
goto loop; 


For about half the values of d, this program takes 4 1/2 iterations, or nine multiplications. For the other half 
(those for which the initial value of XN is "correct to 4 bits"—that is, d? =1 (mod 16)), it takes seven or fewer, 
usually seven, multiplications. Thus, it takes about eight multiplications on average. 


A variation is to simply execute the loop four times, regardless of d, perhaps "strung out" to eliminate the loop 
control (eight multiplications). Another variation is to somehow make the initial estimate xg "correct to 4 


bits" (that is, find xq that satisfies dxg =1 (mod 16)). Then, only three loop iterations are required. Some ways 


to set the initial estimate are 


x, —d+2((d+1) &4), and 


x, d*+d-1. 


Here, the multiplication by 2 is a left shift, and the computations are done modulo 232 (ignoring overflow). 
Because the second formula uses a multiplication, it saves only one. 


This concern about execution time is of course totally unimportant for the compiler application. For that 
application, the routine would be so seldom used that it should be coded for minimum space. But there may be 
applications in which it is desirable to compute the multiplicative inverse quickly. 


Sample Multiplicative Inverses 
We conclude this section with a listing of some multiplicative inverses in Table 10-3. 


Table 10-3. Sample Multiplicative Inverses 


d 


_—_— 
i 559999555 prepneneenenen 


i i —|FFFFFFFF FFFFFFFFFFFFFFFF 
il Poo 


2E8BA2ZE8BA2E8BA3 


= 


C4EC4EC5 4EC4EC4EC4EC4EC5 


W 


DRRRRRRN 


EEEEEEEF EEEEEEEEEEEEEEEF 


gı 


uo 


— 


25 


[op 
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You may notice that in several cases (d = 3, 5, 9, 11), the multiplicative inverse of d is the same as the magic 
number for unsigned division by d (see Section 10-14, "Sample Magic Numbers," on page 189). This is more or 
less a coincidence. It happens that for these numbers, the magic number M is equal to the multiplier m, and 


these are of the form (2P + 1)/d, with p = 32. In this case, notice that 


wW] 


Md =| just (mod 222), 


so that M Œd (mod 222). 


10-16 Test for Zero Remainder after Division by a Constant 


The multiplicative inverse of a divisor d can be used to test for a zero remainder after division by d [GM]. 


Unsigned 


First, consider unsigned division with the divisor d odd. Denote by d the multiplicative inverse of d. Then, 
because dd =1 (mod 2), where W is the machine's word size in bits, d is also odd. Thus, d is relatively 
prime to 2W, and as shown in the proof of theorem MI in the preceding section, as n ranges over all 2W distinct 
values modulo 2W, nd_ takes on all 2W distinct values modulo 2W. 


It was shown in the preceding section that if n is a multiple of d, 


{znd (mod 2"), 


That is, for 
n =Ù, d, 2d, ..., L(2V¥-1)/d Jd, nd 20, 1, 2, ..., LOY- 1d] (mod 2"). therefore. forn 


Wid]. 


not a multiple of d, the value of nd , reduced modulo 2W to the range 0 to 2W - 1, must exceed L(2 


This can be used to test for a zero remainder. For example, to test if an integer n is a multiple of 25, multiply n 


Fe w 
by 25 and compare the rightmost W bits to L(2 -1)/25 dea our basic RISC: 


li M, 0xC28F5C29 Load mult. inverse of 25. 
mul q,M,n q = right half of M*n. 

li c, 0xX0A3D70A3 c = floor((2**32-1)/25). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is_mult if nis a multiple of 25. 


To extend this to even divisors, let d = d, - 2k where d, is odd and k 1. Then, because an integer is divisible 


by d if and only if it is divisible by d, and by 2k, and because n and shave the same number of trailing zeros 


( “eis odd), the test that n is a multiple of d is 


Set g = modine 2"); 


qe Res = 1}/d, | and g ends in & or more 0-bits, 


where the "mod" function is understood to reduce ndoto the interval [0, 2W - 1] 


Direct implementation of this requires two tests and conditional branches, but it can be reduced to one compare- 
branch quite efficiently if the machine has the rotate-shift instruction. This follows from the following theorem, 


a =k 


in which denotes the computer word a rotated right k positions (0 Sk S32), 


THEOREM ZRU. x <a and x ends in k O-bits if and only if 
xiekela/2 |. 


di 
Proof. (Assume a 32-bit machine.) Suppose ** = and x ends in k 0-bits. Then, because 
Fray 


H Ẹ fi È È = Fer 3 1 ft re i 
xža, Lx/2*]&La/2 lgu Lx/2'] = x >>. Therefore, xSksla/2 Ste x does not end in k 0- 
rol p k ig O j, H i 
bits, then * = K does not begin with k 0-bits, whereas La/2‘] does, so x $k SLa/2 Fissi. if 
af 
X > & and x ends in k 0-bits, then the integer formed from the first 32 - k bits of x must exceed that formed 


kja k 
from the first 32 - k bits of a, so that Lx/2 J > La f2 J. 


Using this theorem, the test that n is a multiple of d, where n and d >ı are unsigned integers and d = d, : 2, 
with d, odd, is 


2"): 


g mod(nd, 


os 


qæk = (2"-1)/d]. 


dere we useq LLC- 17d, V] = LEY- D(¢, 29] = Let- pd. 


As an example, the following code tests an unsigned integer n to see if it is a multiple of 100: 


li M,9xC28F5C29 Load mult. inverse of 25. 


mul q,M,n q = right half of M*n. 
shrri q,q,2 Rotate right two positions. 
li c,X'028F5C28' c = floor((2**32-1)/100). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is_mult if n is a multiple of 100. 


Signed, Divisor 22 
For signed division, it was shown in the preceding section that if n is a multiple of d, and d is odd, then 


tand (mod 2"). 
d 


Thus. for © = [-2"-'/d]-d,....-d,0, d, .... (2-1 17d] -d 


F= wW-i Wel We — 
nd = [-2¥-'/d)],....-1,0, 1, ....L(2 —1)/d J (mod 2"). Furthermore, because d` is relatively 
prime to 2, as n ranges over all 2 distinct values modulo 2”, nd takes on all 2W distinct values modulo 2. 
Therefore, n is a multiple of d if and only if 


*we have 


[—2"-'/d]< mod(nd, 2% <|(2"-!-1)/d], 


where the "mod" function is understood to reduce nd_ to the interval [-2W-1, 2W-1- 1]. 


This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not 
equal to 1, it does not divide 2W - 1, Therefore, 


[-2"-'va] = [(-2"-! 4 1)/d] = -L(2"*-!'- 1a]. 


Thus, for signed numbers, the test that n is a multiple of d, where d = d, : 2k and d, is odd, is 


Set g = mod(nd,,, 2"): 


-|(2"-!-1)/d, |< q<|(2"-!-1)/d, | and q ends in k or more 0-bits. 


On the surface, this would seem to require three tests and branches. However, as in the unsigned case, it can be 
reduced to one compare-branch by use of the following theorem. 


Theorem ZRS. If a >o, the following assertions are equivalent: 


1 (1) TSY =a and x ends in $ or more 0-bits, 
ine, FON yp Ut i 

2, (2) wsk | a@/2* |, and 

3, (3) tC kE [20/2], 


where a' is a with its rightmost k bits set to 0 (that is, a' = a & -2k), 


a 


Proof. (Assume a 32-bit machine). To see that (1) is equivalent to (2), clearly the assertion -a =x =a is 


equivalent to abs(x) Sa. Then, Theorem ZRU applies, because both sides of this inequality are nonnegative. 


To see that (1) is equivalent to (3), note that assertion (1) is equivalent to itself with a replaced with a'. Then, 
by the theorem on bounds checking on page 52, this in turn is equivalent to 


Because x + a' ends in k 0-bits if and only if x does, Theorem ZRU applies, giving the result. 


Using part (3) of this theorem, the test that n is a multiple of d, where n and d 22 are signed integers and d = 
d, : 2k with d, odd, is 


qe mod(id,,. 2: 


a’ —|(2"-!- 1)/d, | & -24; 


q +a SkEL(2a')/2* |. 


(a' may be computed at compile time, because d is a constant.) 


As an example, the following code tests a signed integer n to see if it is a multiple of 100. Notice that the 


E # k 
constant | 2a f2 Jen always be derived from the constant a' by a shift of k - 1 bits, saving an instruction or 
a load from memory to develop the comparand. 


li M,9xC28F5C29 Load mult. inverse of 25. 

mul q,M,n q = right half of M*n. 

li c, 0x051EB850 = floor((2**31 - 1)/25) & -4. 
add q,q,C Add c. 

shrri q,q,2 Rotate right two positions. 
shri CC. Compute const. for comparison. 
cmpleu t,q,c Compare q and c, and 

bt t,1is_mult branch if n is a mult. of 100. 


I think that I shall never envision 

An op unlovely as division. 

An op whose answer must be guessed 

And then, through multiply, assessed; 

An op for which we dearly pay, 

In cycles wasted every day. 

Division code is often hairy; 

Long division's downright scary. 

The proofs can overtax your brain, 

The ceiling and floor may drive you insane. 
Good code to divide takes a Knuthian hero, 


But even God can't divide by zero! 


Chapter 11. Some Elementary Functions 


Integer Square Root 
Integer Cube Root 
Integer Exponentiation 


Integer Logarithm 


11-1 Integer Square Root 


By the "integer square root" function, we mean the function Lyx Jro extend its range of application and to 
Sy S032 - 1, 


avoid deciding what to do with a negative argument, we assume x is unsigned. Thus, 0 Sx 
Newton's Method 


For floating-point numbers, the square root is almost universally computed by Newton's method. This method 


begins by somehow obtaining a starting estimate gg of Ja "Then, a series of more accurate estimates is 


obtained from 


The iteration converges quadratically—that is, if at some point is accurate to n bits, then g,, + 4 is accurate to 2n 


bits. The program must have some means of knowing when it has iterated enough, so it can terminate. 


It is a pleasant surprise that Newton's method works fine in the domain of integers. To see this, we need the 
following theorem: 


THEOREM. Let g 
than Ô. Then 


(a) if 2, > | va] then L va J Z Paal = Bn and 
(ifs, = Lda] then LJalsg,,,sLval+1. 


cel = [ ( g+ | a/ g, D7 žl. with g,, @ integers greater 


That is, if we have an integral guess g,, to L fa ie is too high, then the next guess g,, + ; will be strictly less 


than the preceding one, but not less than Lva J-Therefore, if we start with a guess that's too high, the sequence 


converges monotonically. If the guess Eu = Lva Fia the next guess is either equal to g,, or is 1 larger. This 


provides an easy way to determine when the sequence has converged: If we start with 


[va Monae has occurred when gp +1 2g, and then the result is precisely g,. 


The case a = 0, however, must be treated specially, because this procedure would lead to dividing 0 by 0. 


Proof. (a) Because g,, is an integer, 


8n+1 = (s.+ a y2 = DAE = (e+ 2)/2] = gita 
Sn En Rn Dga 


Because Ër 7 [alana gn İs an integer, En” fä- Define e by Er = (1+ E) afa -Then e > 0 and 


4 2 5 
= =F a 
a 2 2 = En +1 = Si t 
22 28, 


+2 atg 
|oreera |= gs Bin 


2(1+e)Ja 28n 
94+2¢+e7 
k 2(1 +e) falze En+l = Bas 
2+3¢e 
Ears da | $8541 < 8p 


| Ja | S Enel < En 


(b) Because “ = |a], la —l<g, < ofa, so that Ea Sa < (8, + Ly Hence we have 


Sit Si = Si = En+l z | Bi + (En + 1) ` 
2ga 28» 


] 
EA ETARE s+ I+ xa, | 


| Ja |< 7 =|e,+1| (because g, is an integer and L <1), 
Bal Sun Ša 5 J 


Lval<e,...<[e,J+1=Lva]+ 1. 


The difficult part of using Newton's method to calculate Lv is getting the first guess. The procedure of 


Figure 11-1 sets the first guess gg equal to the least power of 2 that is greater than or equal to vx "For example, 
for x =4, go = 2, and for x = 5, gp = 4. 


Figure 11-1 Integer square root, Newton's method. 
int isqrt(unsigned x) { 

unsigned x1; 

int s, g0, g1; 


if (x <= 1) return x; 


s = 1; 

x1 =x - 1; 

if (x1 > 65535) {s = s + 8; x1 = x1 >> 16;} 

if (x1 > 255) {s = s + 4; x1 = x1 >> 8;} 

if (x1 > 15) {s = s + 2; x1 = x1 >> 4;} 

if (x1 > 3) {s = s + 1;} 

g0 = 1 << sS; // gO = 2**s, 

g1 = (g0 + (x >> s)) >> 1; 77 g1 = (g0 + x/g0)/2. 

while (g1 < gO) { // Do while approximations 
gO = gi; // strictly decrease. 
g1 = (gO + (x/99)) >> 1; 

} 

return gO; 


Because the first guess go is a power of 2, it is not necessary to do a real division to get g4; instead, a shift right 


suffices. 


Because the first guess is accurate to about 1 bit, and Newton's method converges quadratically (the number of 
bits of accuracy doubles with each iteration), one would expect the procedure to converge within about five 
iterations (on a 32-bit machine), which requires four divisions (because the first iteration substitutes a shift 
right). An exhaustive experiment reveals that the maximum number of divisions is five, or four for arguments 
up to 16,785,407. 


If number of leading zeros is available, then getting the first guess is very simple: Replace the first seven 
executable lines in the procedure above with 


if (x <= 1) return x; 
s = 16 - nlz(x - 1)/2; 


Another alternative, if number of leading zeros is not available, is to compute s by means of a binary search 
tree. This method permits getting a slightly better value of go: the least power of 2 that is greater than or equal 


to Lx Ffor some values of x, this gives a smaller value of gọ, but a value large enough so that the 


convergence criterion of the theorem still holds. The difference in these schemes is illustrated below. 


Range of x for Figure 11-1 Range of x for Figure 11-2 First Guess go 


a ee 1 i 


a + 1 to 232-1 - + 1)2 to 232-1 216 


This procedure is shown in Figure 11-2. It is convenient there to treat small values of x (0 Sy S04) specially, 
so that no divisions are done for them. 


Figure 11-2 Integer square root, binary search for first guess. 


int isqrt(unsigned x) { 
int s, g0, gl} 


if (x <= 4224) 
if (x <= 24) 
if (x <= 3) return (x + 3) >> 2; 
else if (x <= 8) return 2; 
else return (x >> 4) + 3; 
else if (x <= 288) 
if (x <= 80) s = 3; else s = 4; 
else if (x <= 1088) x = 5; else s = 6; 
else if (x <= 1025*1025 - 1) 
if (x <=] 2577257 =- 1) 
if (x <= 129*129 - 1) s = 7; else s = 8; 
else if (x <= 513*513 - 1) s = 9; else s = 10; 
else if (x <= 4097*4097 - 1) 
if (x <= 2049*2049 - 1) s = 11; else s = 12; 
else if (x <= 16385*16385 - 1) 
if (x <= 8193*8193 - 1) s = 13; else s = 14; 
else if (x <= 32769*32769 - 1) s = 15; else s = 16; 
gO = 1 << s; // gO = 2**s, 


// Continue as in Figure 11-1. 


The worst-case execution time of the algorithm of Figure 11-1, on the basic RISC, is about 26 + (D + 6)n 
cycles, where D is the divide time in cycles and n is the number of times the while-loop is executed. The worst- 
case execution time of Figure 11-2 is about 27 + (D + 6)n cycles, assuming (in both cases) that the branch 
instructions take one cycle. The table below gives the average number of times the loop is executed by the two 
algorithms, for x uniformly distributed in the indicated range. 


| x | Figure 11-1 | Figure 11-2 


0to9 0.80 


0 to 99 1.46 
0 to 999 1.58 
0 to 9999 2.13 


' to 232-1 2.97 


If we assume a divide time of 20 cycles and x ranging uniformly from 0 to 9999, then both algorithms execute 
in about 81 cycles. 


Binary Search 


Because the algorithms based on Newton's method start out with a sort of binary search to obtain the first guess, 
why not do the whole computation with a binary search? This method would start out with two bounds, perhaps 
initialized to 0 and 216. It would make a guess at the midpoint of the bounds. If the square of the midpoint is 
greater than the argument x, then the upper bound is changed to be equal to the midpoint. If the square of the 
midpoint is less than the argument x, then the lower bound is changed to be equal to the midpoint. The process 
ends when the upper and lower bounds differ by 1, and the result is the lower bound. 


This avoids division, but requires quite a few multiplications—16 if 0 and 2/6 are used as the initial bounds. 
(The method gets one more bit of precision with each iteration.) Figure 11-3 illustrates a variation of this 


procedure, which uses initial values for the bounds that are slight improvements over 0 and 216, The procedure 
shown in Figure 11-3 also saves a cycle in the loop, for most RISC machines, by altering a and b in such a way 


that the comparison is b Za rather than b - a 21. 
Figure 11-3 Integer square root, simple binary search. 


int isqrt(unsigned x) { 


unsigned a, b, m; // Limits and midpoint. 
a= i; 

b = (x >> 5) + 8; // See text. 

if (b > 65535) b = 65535; 

do { 


m = (a + b) >> 1; 
if (m*m > x) b=m  - 1; 
else a=m+1; 


} while (b >= a); 
return a - 1; 


The predicates that must be maintained at the beginning of each iteration are GS L af ] +l and 


b= Lf Jne initial value of b should be something that's easy to compute and close to Lax | "Reasonable 
initial values are x, x + 4 + 1, x +8 + 2, x + 16 + 4, x + 32 + 8, x + 64 + 16, and so on. Expressions near the 
beginning of this list are better initial bounds for small x, and those near the end are better for larger x. (The 
value x + 2 + 1 is acceptable, but probably not useful, because x + 4 + 1 is everywhere a better or equal bound.) 


Seven variations on the procedure shown in Figure 11-3 can be more or less mechanically generated by 
substituting a + 1 for a, or b - 1 for b, or by changing m = (a + b) + 2 to m = (a + b + 1) + 2, or some 
combination of these substitutions. 


The execution time of the procedure shown in Figure 11-3 is about 6 + (M + 7.5)n, where M is the 


multiplication time in cycles and n is the number of times the loop is executed. The table below gives the 
average number of times the loop is executed, for x uniformly distributed in the indicated range. 


| X | Average Number of Loop Iterations 


0to9 3.00 
0 to 99 3.15 
Se 
0 to 9999 7.04 


i to 232-1 ee 


If we assume a multiplication time of 5 cycles and x ranging uniformly from 0 to 9999, the algorithm runs in 
about 94 cycles. The maximum execution time (n = 16) is about 206 cycles. 


If number of leading zeros is available, the initial bounds can be set from 


b (1 << (33 = nlz(x))/2) = 1; 
a= (b + 3)/2; 


That is, $ = 207-7.) "2 | These are very good bounds for small values of x (one loop iteration for 0 


£x S15), but only a moderate improvement, for large x, over the bounds calculated in Figure 11-3. For x in 
the range 0 to 9999, the average number of iterations is about 5.45, which gives an execution time of about 74 


cycles, using the same assumptions as above. 
A Hardware Algorithm 


There is a shift-and-subtract algorithm for computing the square root that is quite similar to the hardware 
division algorithm described in Figure 9-2 on page 149. Embodied in hardware on a 32-bit machine, this 
algorithm employs a 64-bit register that is initialized to 32 0-bits followed by the argument x. On each iteration, 
the 64-bit register is shifted left two positions, and the current result y (initially 0) is shifted left one position. 
Then 2y + 1 is subtracted from the left half of the 64-bit register. If the result of the subtraction is nonnegative, 
it replaces the left half of the 64-bit register, and 1 is added to y (this does not require an adder, because y ends 
in 0 at this point). If the result of the subtraction is negative, then the 64-bit register and y are left unaltered. The 
iteration is done 16 times. 


This algorithm was described in 1945 [JVN]. 


Perhaps surprisingly, this process runs in about half the time of that of the 64 + 32 => 32 hardware division 
algorithm cited, because it does half as many iterations and each iteration is about equally complex in the two 
algorithms. 


To code this algorithm in software, it is probably best to avoid the use of a doubleword shift register, which 
requires about four instructions to shift. The algorithm in Figure 11-4 [GLS1] accomplishes this by shifting y 
and a mask bit m to the right. It executes in about 149 basic RISC instructions (average). The two expressions y 
| m could also be y + m. 


Figure 11-4 Integer square root, hardware algorithm. 


int isqrt(unsigned x) { 
unsigned m, y, b; 


m = 0x40000000; 
y = 0; 
while(m != 0) { // Do 16 times. 
b=y | m, 
y = y >>1; 
if (x >= b) { 
X = - b; 
y=y|m; 


m = m > 2 


} 


return y; 


The operation of this algorithm is similar to the grade-school method. It is illustrated below, for finding 


F 179 ] on an 8-bit machine. 


1011 0011 x0 Initially, x = 179 (0xB3). 
- 1 b1 


0111 0011 x1 0100 0000 y1 
= 101 b2 0010 0000 y2 


0010 0011 x2 0011 0000 y2 
- 11 01 b3 0001 1000 y3 


0010 0011 x3 0001 1000 y3 (Can't subtract). 
- 11001 b4 0000 1100 y4 


0000 1010 x4 0000 1101 y4 
The result is 13 with a remainder of 10 left in register X. 


It is possible to eliminate the 1f X >= b test by the usual trickery involving shift right signed 31. It can be 


proved that the high-order bit of b is always zero (in fact, b Ss. 228), which simplifies the X >= b predicate 
(see page 22). The result is that the if statement group can be replaced with 


t = (int)(x | -~(x = b)) >> 31; // -1 if x >= b, else O. 
x x - (b & t); 
y=y | (m& t); 


This replaces an average of three cycles with seven, assuming the machine has or not, but it might be 
worthwhile if a conditional branch in this context takes more than five cycles. 


Somehow it seems that it should be easier than some hundred cycles to compute an integer square root in 
software. Toward this end, we offer the expressions below to compute it for very small values of the argument. 
These can be useful to speed up some of the algorithms given above, if the argument is expected to be small. 


| The expression | is correct in the range | and uses this many instructions (full RISC). 


Otol 0 


x 3>(" 42) 
ests fee 
(x +15) to 15 2 


poorer pes 3 
as pews 5 


Ah, the elusive square root, 


It should be a cinch to compute. 
But the best we can do 
Is use powers of two 


And iterate the method of Newt! 


11-2 Integer Cube Root 


For cube roots, Newton's method does not work out very well. The iterative formula is a bit complex: 


and there is of course the problem of getting a good starting value Xo. 


However, there is a hardware algorithm, similar to the hardware algorithm for square root, that is not too bad 
for software. It is shown in Figure 11-5. 


Figure 11-5 Integer cube root, hardware algorithm. 
int icbrt(unsigned x) { 


int s; 
unsigned y, b; 


s = 30; 

y = 0; 

while(s >= 0) { // Do 11 times. 
y = 2y; 


D= (o"y"(y +1) + 1) << s; 
S =S - 3; 
if (x >= b) { 
x=xX - b; 
y= y + 2; 


} 
} 


return y; 


The three add's of 1 can be replaced by or's of 1, because the value being incremented is even. Even with this 
change, the algorithm is of questionable value for implementation in hardware, mainly because of the 
multiplication y * (y + 1). 


This multiplication is easily avoided by applying the compiler optimization of strength reduction to the y- 


squared term. Introduce another unsigned variable y2 that will have the value of Y-squared, by updating y2 
appropriately wherever y receives a new value. Just before y = © insert y2 = O. Just before y = 2*y insert 
y2 = 4*y2. Change the assignment to b to b = (3*y2 + 3*y +1) << s (and factor out the 3). Just before 
y =y +1, insert y2 = y2 + 2*y + 1. The resulting program has no multiplications except by small 


constants, which can be changed to shift's and add's. This program has three add's of 1, which can all be 
changed to or's of 1. It is faster unless your machine's multiply instruction takes only two or fewer cycles. 


Caution: [GLS1] points out that the code of Figure 11-5, and its strength-reduced derivative, do not work if 
adapted in the obvious way to a 64-bit machine. The assignment to b can then overflow. This problem can be 
avoided by dropping the shift left of S from the assignment to b, inserting after the assignment to D the 
assignment bs = b << s, and changing the two lines if (Xx >=b) {X=x-b...toif (x >= bs && b 
== (bs>>s)) {x=x-bs.... 


11-3 Integer Exponentiation 

Computing x" by Binary Decomposition of n 

A well-known technique for computing x”, when n is a nonnegative integer, involves the binary representation 
of n. The technique applies to the evaluation of an expression of the form x ° x° x° xe... e x where ¢ is any 
associative operation, such as addition, multiplication including matrix multiplication, and string concatenation 
(as suggested by the notation (‘ab’)? = 'ababab'). As an example, suppose we wish to compute y = x13. Because 


13 expressed in binary is 1101 (that is, 13 =8+4+ 1), 


x = yhttel = y. yt. yl, 


Thus, x13 may be computed as follows: 


H e art 
hT 
het 


yeh hX 


This requires five multiplications, considerably fewer than the 12 that would be required by repeated 
multiplication by x. 


If the exponent is a variable, known to be a nonnegative integer, the technique can be employed in a subroutine, 
as shown in Figure 11-6. 


Figure 11-6 Computing x” by binary decomposition of n. 


int iexp(int x, unsigned n) { 


int p, Yy; 

y= i; // Initialize result 
p = xX; // and p. 

while(1) { 


if (n & 1) y = p*y; // If n is odd, mult by p. 


n = n >> 1; // Position next bit of n. 
if (n == 0) return y; // If no more bits in n. 
p = p*p; // Power for next bit of n. 


The number of multiplications done by this method is, for exponent n 1, 


[logan | + nbits(r) — 1. 


This is not always the minimal number of multiplications. For example, for n = 27 the binary decomposition 
method computes 


xl6. y8. y2. yl, 


which requires seven multiplications. However, the scheme illustrated by 


(PY 


requires only six. The smallest number for which the binary decomposition method is not optimal is n =15 
(hint: x15 = (x3)9). 


Perhaps surprisingly, there is no known simple method that, for all n, finds an optimal sequence of 
multiplications to compute x”. The only known methods involve an extensive search. The problem is discussed 
at some length in [Knu2, sec. 4.6.3]. 


The binary decomposition method has a variant that scans the binary representation of the exponent in left-to- 
right order [Rib, 32], which is analogous to the left-to-right method of converting binary to decimal. Initialize 
the result y to 1, and scan the exponent from left to right. When a 0 is encountered, square y. When a 1 is 


ae 13. lll, 
encountered, square y and multiply it by x. This computes x= x “a 


S 


Ua xy xP x. 


It always requires the same number of (nontrivial) multiplications as the right-to-left method of Figure 11-6. 


29 in Fortran 


The IBM XL Fortran compiler takes the definition of this function to be 


w, Q#An2 530, 
pow) = 4-33!) y = 31, 


0, n<Oorn> 32. 


It is assumed that n and the result are interpreted as signed integers. The ANSI/ISO Fortran standard requires 


that the result be 0 if n < 0. The definition above for n 31 seems reasonable in that it is the correct result 
modulo 222, and it agrees with what repeated multiplication would give. 


The standard way to compute 2” is to put the integer 1 in a register and shift it left n places. This does not 
satisfy the Fortran definition, because shift amounts are usually treated modulo 64 or modulo 32 (on a 32-bit 
machine), which gives incorrect results for large or negative shift amounts. 


If your machine has number of leading zeros, pow2(n) may be computed in four instructions as follows [Shep]: 


x & nla(n = 5); Hae 32 if Of n 231, x < 32 otherwise. 


reri; H xe lifQ<n= 31, 0 otherwise. 


pow? ex æn; 


The shift right operations are "logical" (not sign-propagating), even though n is a signed quantity. 


If the machine does not have the "nlz" instruction, its use above can be replaced with one of the x = 0 tests 


H di 
given in "Comparison Predicates" on page 21, changing the expression * => 5 tox Sli, possibly better 


r r u 
method is to realize that the predicate 0 Sx S31 is equivalent to * < 32 ‘and then simplify the expression for 


H 
wey given in the cited section; it becomes =x & (x - 32). This gives a solution in five instructions (four if the 
machine has and not): 


x ean & (n- 32); ffx<OifO=Zn= 3. 
xex-3l: fix = 1ifQ<n<31, 0 otherwise. 


pow2 ex æn; 


11-4 Integer Logarithm 


By the "integer logarithm" function we mean the function L Veh t J where x is a positive integer and b is an 
integer greater than or equal to 2. Usually, b = 2 or 10, and we denote these functions by "ilog2" and "ilog10," 
respectively. We use "ilog" when the base is unspecified. 


It is convenient to extend the definition to x = 0 by defining ilog(0) = -1 [CJS]. There are several reasons for 
this definition: 


e The function ilog2(x) is then related very simply to the number of leading zeros function, nlz(x), 
by the formula shown below, including the case x = 0. Thus, if one of these functions is implemented 
in hardware or software, the other is easily obtained. 


ilog2ix) = 31 — nlz{x) 


e Itis easy to compute [oge lasing the formula below. For x = 1, this formula implies that ilog 
(0) = -1. 


[logio] = ilogix- 1) +1 


e It makes the following identity hold for x = 1 (but it doesn't hold for x = 0): 


log2(v +2) = ilog2(x) - | 


e It preserves the mathematical identity: 


[| logigx | = | (log,,2)logx | 


e It makes the result of ilog(x) a small dense set of integers (-1 to 31 for ilog2(x) on a 32-bit 
machine, with x unsigned), making it directly useful for indexing a table. 


e It falls naturally out of several algorithms for computing ilog2(x) and ilog10(x). 


Unfortunately, it isn't the right definition for "number of digits of x," which is ilog(x) + 1 for all x except x = 0. 
But it seems best to consider that anomalous. 


For x < 0, ilog(x) is left undefined. To extend its range of utility, we define the function as mapping unsigned 
numbers to signed numbers. Thus, a negative argument cannot occur. 


Integer Log Base 2 


Computing ilog2(x) is essentially the same as computing the number of leading zeros, which is discussed in 
"Counting Leading 0's" on page 77. All the algorithms in that section can be easily modified to compute ilog2 
(x) directly, rather than by computing nlz(x) and subtracting the result from 31. (For the algorithm of Figure 5- 
11 on page 80, change the line return pop(~x) toreturn pop(x) - 1). 


Integer Log Base 10 


This function has application in converting a number to decimal for inclusion into a line with leading zeros 
suppressed. The conversion process successively divides by 10, producing the least significant digit first. It 
would be useful to know ahead of time where the least significant digit should be placed, to avoid putting the 
converted number in a temporary area and then moving it. 


To compute ilog10(x), a table search is quite reasonable. This could be a binary search, but because the table is 
small and in many applications x is usually small, a simple linear search is probably best. This rather 
straightforward program is shown in Figure 11-7. 


Figure 11-7 Integer log base 10, simple table search. 


int ilog1i0(unsigned x) { 
int i; 
static unsigned table[11] = {0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 
OxFFFFFFFF}; 


for (i = -1; ; i++) { 
if (x <= table[i+1]) return i; 


} 


9+ 4| logor] 


On the basic RISC, this program can be implemented to execute in about instructions. Thus, it 


executes in five to 45 instructions, with perhaps 13 (for 10 Ex S99) being typical. 
The program in Figure 11-7 can easily be changed into an "in register" version (not using a table). The 
executable part of such a program is shown in Figure 11-8. This might be useful if the machine has a fast way 


to multiply by 10. 


Figure 11-8 Integer log base 10, repeated multiplication by 10. 


p= 1, 

for (i = -1; i <= 8; i++) { 
if (x < p) return i; 
p = 10*p; 

J 

return i; 


10 + 6| logor] 


instructions on the basic RISC 
=, S99. 


This program can be implemented to execute in about 


(counting the multiply as one instruction). This amounts to 16 instructions for 10 


A binary search can be used, giving an algorithm that is loop-free and does not use a table. Such an algorithm 


might compare x to 104, then to either 102 or to 10®, and so on, until the exponent n is found such that 10” <y 
< 10” + 1, The paths execute in ten to 18 instructions, four or five of which are branches (counting the final 
unconditional branch). 


The program shown in Figure 11-9 is a modification of the binary search that has a maximum of four branches 


on any path, and is written in a way that favors small x. It executes in six basic RISC instructions for 10 Ex 


S99, and in 11 to 16 instructions for x = 100. 
Figure 11-9 Integer log base 10, modified binary search. 


int ilogi0(unsigned x) { 
if (x > 99) 
if (x < 1000000) 
if (x < 10000) 
return 3 + ((int)(x - 1000) >> 31); 
else 
return 5 + ((int)(x - 100000) >> 31); 
else 
if (x < 100000000) 
return 7 + ((int)(x - 10000000) >> 31); 
else 


return 9 + ((int)((x-1000000000)&~x) >> 31); 


else 
if (x > 9) return 1; 
else return ((int)(x - 1) >> 31); 


The shift instructions in this program are signed shifts (which is the reason for the (int ) casts). If your 


machine does not have this instruction, one of the alternatives below, which use unsigned shifts, may be 
preferable. These are illustrated for the case of the first return statement. Unfortunately, the first two require 


subtract from immediate for efficient implementation, which most machines don't have. The last involves 
adding a large constant (two instructions), but this does not matter for the second and third return 


statements, which require adding a large constant anyway. The large constant is 23! - 1000. 
return 3 - ((x - 1000) >> 31); 


return 2 + ((999 - x) >> 31); 
return 2 + ((x + 2147482648) >> 31); 


An alternative for the fourth return statement is 


return 8 + ((x + 1147483648) | x) >> 31; 


where the large constant is 231 - 109. This avoids both the and not and the signed shift. 


Alternatives for the last if-else construction are 


return ((int)(x - 1) >> 31) | ((unsigned)(9 - x) >> 31); 
return (x > 9) + (x > 0) - 1; 


either of which saves a branch. 


If nlz(x) or ilog2(x) is available as an instruction, there are better and more interesting ways to compute ilog10 
(x). For example, the program in Figure 11-10 does it in two table lookups [CJS]. 


Figure 11-10 Integer log base 10 from log base 2, double table lookup. 


int ilogi0(unsigned x) { 
int y; 
static unsigned char table1[33] = {9, 9, 9, 8, 8, 8, 
7, 7, 7,6,6, 6, 6, 5 5, 5, 4, 4, 4, 3, 3, 3, 3, 
2, 2, 2, 1, 1, 1, 0, O, 0, O}; 
static unsigned table2[10] = {1, 10, 100, 1000, 10000, 


100000, 1000000, 10000000, 100000000, 1000000000}; 


y = table1[n1z(x)]; 
if (x < table2[y]) y =y - 1; 
return y; 


From table‘ an approximation to ilog10(x) is obtained. The approximation is usually the correct value, but 


it is too high by 1 for x = 0 and for x in the range 8 to 9, 64 to 99, 512 to 999, 8192 to 9999, and so on. The 
second table gives the value below which the estimate must be corrected by subtracting 1. 


This scheme uses a total of 73 bytes for tables, and can be coded in only six instructions on the IBM 
System/370 [CJS] (to achieve this, the values in tab1e1 must be four times the values shown). It executes in 
about ten instructions on a RISC that has number of leading zeros but no other esoteric instructions. The other 
methods to be discussed are variants of this. 


The first variation eliminates the conditional branch that results from the if statement. Actually, the program in 
Figure 11-10 can be coded free of branches if the machine has the set less than unsigned instruction, but the 
method to be described can be used on machines that have no unusual instructions (other than number of 
leading zeros). 


The method is to replace the if statement with a subtraction followed by a shift right of 31 so that the sign bit 


can be subtracted from y. A difficulty occurs for large x (x 2231 + 109) which can be fixed by adding an entry 
to table2, as shown in Figure 11-11. 


Figure 11-11 Integer log base 10 from log base 2, double table lookup, branch-free. 


int ilog10(unsigned x) { 
int y; 
static unsigned char table1[33] = {10, 9, 9, 
7, 7,7,6,6, 6, 6, 5 5 54 4,4,3 
2, 2, 2, 1, 1, 1, O, 0, 0, O}; 
static unsigned table2[11] = {1, 10, 100, 1000, 10000, 
100000, 1000000, 10000000, 100000000, 1000000000, 


oF 


8, 8, 8, 
y 3, 3, 3, 


y = tableí[nlz(x)]; 
y=y - ((x - table2[y]) >> 31); 
return y; 


This executes in about 11 instructions on a RISC that has number of leading zeros but is otherwise quite 


"basic." It can be modified to return the value 0, rather than -1, for x = 0 (which is preferable for the decimal 
conversion problem) by changing the last entry in table1 to 1 (that is, by changing "0, 0, 0, 0" to "0, 0, 0, 1"). 


The next variation replaces the first table lookup with a subtraction, a multiplication, and a shift. This seems 
likely to be possible because log; 9x and log>x are related by a multiplicative constant, namely log; 92 = 


-ilosi y 
0.30103 .... Thus, it may be possible to compute ilog10(x) by computing Le ilog2(x) J fo some suitable c 


0.30103, and correcting the result by using a table such as table2 in Figure 11-11. 


To pursue this, let log; 92 = c + £, where c > 0 is a rational approximation to log,,2 that is a convenient 


multiplier, and £ > 0. Then for x >], 


iloglO(x) = | logiox | = | (c+ e)logyx | 
| c logsx | <iloglO(x) = |c log,x+¢ logax | 
Le ilog2(x) | £ ilog10(x) S | e (ilog2(x) + 1) + € loga | 
S| c ilog2(x)+ c +e log. | 
< Le ilog2(x)]+[¢ +e log,x | +1. 


aoa ry 

Thus, if we choose c so that c + elog>x < 1, then Le ilog2(a opone ilog10(x) with an error of 0 or +1. 
ERER L ilee 5 

Furthermore, if we take ilog2(0) = ilog10(0) = -1, then Le ilog2(0) J = ilog10(0) (because 0 < c £1), so we 
need not be concerned about this case. (There are other definitions that would work here, such as ilog2(0) = 
ilog10(0) = 0). 
Because € = log,92 - c, we must choose c so that 
c+ (log;)2—c)log,x <1, or 


e(log.x — 1) > (log,,2)log.x = l. 


This is satisfied for x = 1 (because c < 1) and 2. For larger x, we must have 


(log,,2)log,x—- | 
oy Se” 
log,x — | 


The most stringent requirement on c occurs when x is large. For a 32-bit machine, x < 232, so choosing 


«> 0:30103 - 32 - I 


= 0.27846 
32-1 


suffices. Because c < 0.30103 (because £ > 0), c = 9/32 = 0.28125 is a convenient value. Experimentation 
reveals that coarser values such as 5/16 and 1/4 are not adequate. 


This leads to the scheme illustrated in Figure 11-12, which estimates low and then corrects by adding 1. It 


executes in about 11 instructions on a RISC that has number of leading zeros, counting the multiply as one 
instruction. 


Figure 11-12 Integer log base 10 from log base 2, one table lookup. 


static unsigned table2[10] = {0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999}; 


y = (9*(31 - nlz(x))) >> 5; 
if (x > table2[y+1]) y = y + 1; 
return y; 


This can be made into a branch-free version, but again there is a difficulty with large x (x > 231 + 109) which 
can be fixed in either of two ways. One way is to use a different multiplier (19/64) and a slightly expanded 
table. The program is shown in Figure 11-13 (about 11 instructions on a RISC that has number of leading zeros, 
counting the multiply as one instruction). 


Figure 11-13 Integer log base 10 from log base 2, one table lookup, branch-free. 


int ilog10(unsigned x) { 
int y; 
static unsigned table2[11] = {0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 
OxFFFFFFFF}; 


y (19*(31 - nlz(x))) >> 6; 
y = y + ((table2[y+1] - x) >> 31); 
return y; 


The other "fix" is to or x into the result of the subtraction, to force the sign bit to be on for x = 731. that is, 
change the second executable line of Figure 11-12 to 


y = y + (((table2[y+1] - x) | x) >> 31); 


This is the preferable program if multiplication by 19 is substantially more difficult than multiplication by 9 (as 
it is for a shift-and-add sequence). 


For a 64-bit machine, choosing 


0.30103-64 — | 
Ce ————— 


= 0.28993 
4- 1 


suffices. The value 19/64 = 0.296875 is convenient, and experimentation reveals that no coarser value is 
adequate. The program is (branch-free version) 


unsigned table2[20] = {0, 9, 99, 999, 9999, ..., 
9999999999999999999}; 

y = ((19*(63 - nlz(x)) >> 6; 

y=y + ((table2[y+1] - x) >> 63; 

return y; 


Chapter 12. Unusual Bases for Number Systems 


This section discusses a few unusual positional number systems. They are just interesting curiosities and are 
probably not practical for anything. We limit the discussion to integers, but they can all be extended to include 
digits after the radix point—which usually, but not always, denotes non-integers. 


12-1 Base -2 


By using -2 as the base, both positive and negative integers can be expressed without an explicit sign or other 
irregularity such as having a negative weight for the most significant bit (Knu3). The digits used are 0 and 1, as 
in base +2; that is, the value represented by a string of 1's and 0's is understood to be 


1 2 
(ap Aaii) = a,(-2)" +... ta3(-2) + 4,(-2) + a,(-2) + ag. 


From this, it can be seen that a procedure for finding the base -2, or "negabinary," representation of an integer is 
to successively divide the number by -2, recording the remainders. The division must be such that it always 
gives a remainder of 0 or 1 (the digits to be used); that is, it must be modulus division. As an example, the plan 
below shows how to find the base -2 representation of -3. 


-3 

= = 475 
3 2 rem | 
9 

= = =| rem 
=f 

-1 = | rem | 
=) 

] _ 

— = Orem! 
=} 


Because we have reached a 0 quotient, the process terminates (if continued, the remaining quotients and 
remainders would all be 0). Thus, reading the remainders upwards, we see that -3 is written 1101 in base -2. 


Table 12-1 shows, on the left, how each bit pattern from 0000 to 1111 is interpreted in base -2, and on the right, 
how integers in the range -15 to +15 are represented. 


Table 12-1. Conversions between Decimal and Base -2 


n (base -2) n (decimal) n (decimal) n (base -2) -n (base -2) 


It is not obvious that the 2” possible bit patterns in an n-bit word uniquely represent all integers in a certain 
range, but this can be shown by induction. The inductive hypothesis is that an n-bit word represents all integers 
in the range 


Equation 1a 


—(2"*!_3)/3 to (2"— 1)/3 forn even, and 


Equation 1b 


(—(2" —2)/3) to ((2"*! — 1)/3) for n odd. 


Assume first that n is even. For n = 2, the representable integers are 10, 11, 00, and 01 in base -2, or 


-4 —I1, 0, 1. 


This agrees with (1a), and each integer in the range is represented once and only once. 


A word of n + 1 bits can, with a leading bit of 0, represent all the integers given by (1a). In addition, with a 
leading bit of 1, it can represent all these integers biased by (-2)" = 2”. The new range is 


2" (2"+1_2)/3 to 2" + (2"—1)/3, 


or 


(2" —1)/3 + L to (2"*2-1)/3. 


This is contiguous to the range given by (1a), so for a word size of n + 1 bits, all integers in the range 


~(2"+1_2)/3 to (2"*2-1)/3 


are represented once and only once. This agrees with (1b), with n replaced by n + 1. 


The proof that (1a) follows from (1b), for n odd, and that all integers in the range are uniquely represented, is 
similar. 


To add and subtract, the usual rules, such as 0 + 1 = 1 and 1 - 1 = 0, of course apply. Because 2 is written 110, 


and -1 is written 11, and so on, the following additional rules apply. These, together with the obvious ones, 
suffice. 


l+1 = 110 
ll+1=0 
l+1+1 = 111 
0-l= 11 
li-1 = 10 


When adding or subtracting, there are sometimes two carry bits. The carry bits are to be added to their column, 
even when subtracting. It is convenient to place them both over the next bit to the left, and simplify (when 
possible) using 11 + 1 = 0. If 11 is carried to a column that contains two 0's, bring down a 1 and carry a 1. 
Below are examples. 


Addition Subtraction 
11 1 11 11 1 11 1 1 
1 O0 1 1 1 19 1010 1 21 
+11 © 1 0 1 +(-11) -1 O 1 1 1 0 -(-38) 
0 1 1 0O 0 0 8 1 0O O 1 1 1 1 59 


The only carries possible are 0, 1, and 11. Overflow occurs if there is a carry (either 1 or 11) out of the high- 
order position. These remarks apply to both addition and subtraction. 


Because there are three possibilities for the carry, a base -2 adder would be more complex than a two's- 
complement adder. 


There are two ways to negate an integer. It may be added to itself shifted left one position (that is, multiply by - 
1), or it may be subtracted from 0. There is no rule as simple and convenient as the "complement and add 1" 
rule of two's-complement arithmetic. In two's-complement, this rule is used to build a subtracter from an adder 


(to compute A - B form “A + B + 1). There does not seem to be any such simple device for base -2. 


Multiplication of base -2 integers is straightforward. Just use the rule that 1 x 1 = 1 and 0 times either 0 or 1 is 
0, and add the columns using base -2 addition. 


Division, however, is quite complicated. It is a real challenge to devise a reasonable hardware division 
algorithm—that is, one based on repeated subtraction and shifting. Figure 12-1 shows an algorithm that is 
expressed, for definiteness, for an 8-bit machine. It does modulus division (nonnegative remainder). 


Figure 12-1 Division in base -2. 


int divbm2(int n, int d) 4 // q = n/d in base -2. 
int r, dw, C, q, 3; 


r=n; // Init. remainder. 
dw = (-128)*d; // Position d. 

c = (-43)*d; // Init. comparand. 
if (d> 0) c=c+d; 

q = 0; // Init. quotient. 


for (1 = 7; i >= 0; i--) { 
if (d > © ^ (1&1) == 0 ^ r >= c) { 


q=q | (1 << i); // Set a quotient bit. 
r=r - dw; // Subtract d shifted. 
} 
dw = dw/(-2); // Position d. 
if (d> 0) c =c = 2*d; // Set comparand for 
else c =c +d; // next iteration. 
c = c/(-2); 
} 
return q; // Return quotient in 
// base -2. 
// Remainder is r, 
} // 0 <= r< jd]. 


Although this program is written in C and was tested on a binary two's-complement machine, that is immaterial 
—it should be viewed somewhat abstractly. The input quantities N and d, and all internal variables except for 


q, are simply numbers without any particular representation. The output q is a string of bits to be interpreted in 
base -2. 


This requires a little explanation. If the input quantities were in base -2, the algorithm would be very awkward 
to express in an executable form. For example, the test "if (d > ©)" would have to test that the most 


significant bit of d is in an even position. The addition in "C = C + d" would have to be a base -2 addition. The 
code would be very hard to read. The way the algorithm is coded, you should think of N and d as numbers 


without any particular representation. The code shows the arithmetic operations to be performed, whatever 
encoding is used. If the numbers are encoded in base -2, as they would be in hardware that implements this 
algorithm, the multiplication by -128 is a left shift of seven positions, and the divisions by -2 are right shifts of 
one position. 


As examples, the code computes values as follows: 


divbm2(6, 2) = 7 (six divided by two is 1112) 


divbm2(-4, 3) = 2 (minus four divided by three is 10_5) 
divbm2(-4, -3) = 6 (minus four divided by minus 3 is 110_5) 


The step q =q | (1 <<); represents simply setting bit i of q. The next line— r = r - Gw—represents 
reducing the remainder by the divisor d shifted left. 


The algorithm is difficult to describe in detail, but we will try to give the general idea. 


Consider determining the value of the first bit of the quotient, bit 7 of q. In base -2 8-bit numbers that have 


their most significant bit "on" range in value from -170 to -43. Therefore, ignoring the possibility of overflow, 
the first (most significant) quotient bit will be 1 if (and only if) the quotient will be algebraically less than or 
equal to -43. 


< 


Because n = qd + r and for a positive divisor r =d - 1, for a positive divisor the first quotient bit will be 1 iff n 


<. 43d + (d - 1), orn < - 43d + d. For a negative divisor, the first quotient bit will be 1 iff n = 43d (r 20 for 
modulus division). 


Thus, the first quotient bit is 1 if 


(d> 0 & {n 2- 43d + di) 1 (d <0 & n 2-43). 


Ignoring the possibility that d = 0 this can be written as 


d>QO@n2c, 


where c = - 43d + d if d > 0, and c = - 43d if d < 0. 


This is the logic for determining a quotient bit for an odd-numbered bit position. For an even-numbered 


position, the logic is reversed. Hence the test includes the term (1&1) == ©. (The ^ character in the program 
denotes exclusive or.) 


At each iteration, C is set equal to the smallest (closest to zero) integer that must have a 1-bit at position 1 after 
dividing by d. If the current remainder r exceeds that, then bit 1 of q is set to 1 and r is adjusted by 
subtracting the value of a 1 at that position, multiplied by the divisor d. No real multiplication is required here; 


d is simply positioned properly and subtracted. 


The algorithm is not elegant. It is awkward to implement because there are several additions, subtractions, and 
comparisons, and there is even a multiplication (by a constant) that must be done at the beginning. One might 
hope for a "uniform" algorithm—one that does not test the signs of the arguments and do different things 
depending on the outcome. Such a uniform algorithm, however, probably does not exist for base -2 (or for 
two's-complement arithmetic). The reason for this is that division is inherently a non-uniform process. Consider 
the simplest algorithm of the shift-and-subtract type. This algorithm would not shift at all, but for positive 
arguments would simply subtract the divisor from the dividend repeatedly, counting the number of subtractions 
performed until the remainder is less than the divisor. But if the dividend is negative (and the divisor is 
positive), the process is to add the divisor repeatedly until the remainder is 0 or positive, and the quotient is the 
negative of the count obtained. The process is still different if the divisor is negative. 


In spite of this, division is a uniform process for the signed-magnitude representation of numbers. With such a 
representation, the magnitudes are positive, so the algorithm can simply subtract magnitudes and count until the 
remainder is negative, and then set the sign bit of the quotient to the exclusive or of the arguments, and the sign 
bit of the remainder equal to the sign of the dividend (this gives ordinary truncating division). 


The algorithm given above could be made more uniform, in a sense, by first complementing the divisor, if it is 
negative, and then performing the steps given as simplified by having d > 0. Then a correction would be 
performed at the end. For modulus division, the correction is to negate the quotient and leave the remainder 
unchanged. This moves some of the tests out of the loop, but the algorithm as a whole is still not pretty. 


It is interesting to contrast the commonly used number representations and base -2 regarding the question of 
whether or not the computer hardware treats numbers uniformly in carrying out the four fundamental arithmetic 
operations. We don't have a precise definition of "uniformly," but basically it means free of operations that 
might or might not be done, depending on the signs of the arguments. We consider setting the sign bit of the 
result equal to the exclusive or of the signs of the arguments to be a uniform operation. Table 12-2 shows which 
operations treat their operands uniformly with various number representations. 


One's-complement addition and subtraction are done uniformly by means of the "end around carry" trick. For 
addition, all bits, including the sign bit, are added in the usual binary way, and the carry out of the leftmost bit 
(the sign bit) is added to the least significant position. This process always terminates right away (that is, the 
addition of the carry cannot generate another carry out of the sign bit position). 


Table 12-2. Uniform Operations in Various Number Encodings 


| | Signed-magnitude One's-complement | Two's-complement | Base -2 
pation j a rs a 
pba i bs bs bs 


me pe no i bs 
pion pe no i j 


In the case of two's-complement multiplication, the entry is "yes" if only the right half of the doubleword 
product is desired. 


We conclude this discussion of the base -2 number system with some observations about how to convert 
between straight binary and base -2. 


To convert to binary from base -2, form a word that has only the bits with positive weight, and subtract a word 
that has only the bits with negative weight, using the subtraction rules of binary arithmetic. An alternative 
method that may be a little simpler is to extract the bits appearing in the negative weight positions, shift them 
one position to the left, and subtract the extracted number from the original number using the subtraction rules 
of ordinary binary arithmetic. 


To convert to base -2 from binary, extract the bits appearing in the odd positions (positions weighted by 2” with 
n odd), shift them one position to the left, and add the two numbers using the addition rules of base -2. Here are 
two examples: 


Binary from base -2 Base -2 from binary 
110111 (-13) 110111 (55) 
-101 (binary subtract) + 1 0 2 (base -2 add) 


«+111110011 (-13) 1001011 (55) 


On a computer, with its fixed word size, these conversions work for negative numbers if the carries out of the 
high-order position are simply discarded. To illustrate, the example on the right above can be regarded as 
converting -9 to base -2 from binary if the word size is six bits. 


The above algorithm for converting to base -2 cannot easily be implemented in software on a binary computer, 
because it requires doing addition in base -2. Schroeppel [HAK, item 128] overcomes this with a much more 


clever and useful way to do the conversions in both directions. To convert to binary, his method is 


B + (N@ 0b10...1010) —0b10...1010. 


To see why this works, let the base -2 number consist of the four digits abcd. Then, interpreted (erroneously) in 
straight binary, this is 8a + 4b + 2c + d. After the exclusive or, interpreted in binary it is 8(1 - a) + 4b + 2(1 - c) 
+ d. After the (binary) subtraction of 8 + 2, it is - 8a + 4b - 2c + d, which is its value interpreted in base -2. 


Schroeppel's formula can be readily solved for N in terms of B, so it gives a three-instruction method for 


converting in the other direction. Collecting these results, we have the following formulas for converting to 
binary, for a 32-bit machine: 


B & (N & 0x55555555) — (N & 06553355355), 
B N-((N & OxXAAAAAAAA) < 1), 
B + (N @ OxAAAAAAAA) — OXAAAAAAAA, 


and the following, for converting to base -2 from binary: 


N — (B+ O0xAAAAAAAA) @ OXWAAAAAAAA, 


12-2 Base -1 +i 


By using - 1 + i as the base, where i is v1 all complex integers (complex numbers with integral real and 
imaginary parts) can be expressed as a single "number" without an explicit sign or other irregularity. 
Surprisingly, this can be done using only 0 and 1 for digits, and all integers are represented uniquely. We will 
not prove this or much else about this number system, but will just describe it very briefly. 


[2] 

It is not entirely trivial to discover how to write the integer 2. | However, this can be determined 
algorithmically by successively dividing 2 by the base and recording the remainders. What does a "remainder" 
mean in this context? We want the remainder after dividing by - 1 + i to be 0 or 1, if possible (so that the digits 
will be 0 or 1). To see that it is always possible, assume that we are to divide an arbitrary complex integer a + 
bi by - 1 + i. Then, we wish to find q and r such that q is a complex integer, r = 0 or 1, and 


[1] The interested reader might warm up to this challenge. 


at+bi = (q, ,+qi- 1 +i)+r, 


where q, and q; denote the real and imaginary parts of q, respectively. Equating real and imaginary parts and 


solving the two simultaneous equations for q gives 


b-a+r 
e and 


_ _ ~a-bt+r 
di5 — z 


Clearly, if a and b are both even or are both odd, then by choosing r = 0, q is a complex integer. Furthermore, if 
one of a and b is even and the other is odd, then by choosing r = 1, q is a complex integer. 


Thus, the integer 2 can be converted to base - 1 + i by the plan illustrated below. 


Because the real and imaginary parts of the integer 2 are both even, we simply do the division, knowing that the 
remainder will be 0: 


= — = - ] -i rem 0. 


Because the real and imaginary parts of - 1 - i are both odd, again we simply divide, knowing that the remainder 
is 0: 


-l-i _ (-1-i(-1- 
“14, Teel 


= jrem 0, 


Because the real and imaginary parts of i are even and odd, respectively, the remainder will be 1. It is simplest 
to account for this at the beginning by subtracting 1 from the dividend. 


aot = 1 (remainder is 1). 
—l+i 


Because the real and imaginary parts of 1 are odd and even, the next remainder will be 1. Subtracting this from 
the dividend gives 


t=” = 0 (remainder is 1). 
-l +i 


Because we have reached a 0 quotient, the process terminates, and the base - 1 + i representation for 2 is seen to 
be 1100 (reading the remainders upwards). 


Table 12-3 shows how each bit pattern from 0000 to 1111 is interpreted in base - 1 + i, and how the real 
integers in the range -15 to +15 are represented. 


The addition rules for base - 1 + i (in addition to the trivial ones involving a 0 bit) are as follows: 


1+1 = 1100 

l+1+1 = 1101 
l+l+l+l = 111010000 
l+l+lil+il+l = 111010001 
l+14+1414141 = 111011100 
L+l+l+l4+4l414+1 = 111011101 
P+1+14+14+1414141 = 111000000 


Table 12-3. Conversions between Decimal and Base -1 + / 


| n (base -1 + i) | n (decimal) | n (decimal) | n (base -1 + i) | -n (base -1 + i) 
0 0 


101 - 2i F [7200001 peonon 


110 KE, 6 111011100 
D ee ee 
1000 2 + 2i 111000000 
moo P po pa~ 
1010 1+ 3i 10 111001100 


1011 2+3i 11 111001101 


11001100 
ne 
11000000 
11011101 
11011100 


11010001 


pn É [00010000 11010000 
1101 f f 100010001 e 


pn Pe É 100011100 1110100001100 
p P P 1110100000001 


When adding two numbers, the largest number of carries that occurs in one column is six, so the largest sum of 


a column is 8 (111000000). This makes for a rather complicated adder. If one were to build a complex 
[2] 


arithmetic machine, it would no doubt be best to keep the real and imaginary parts separate, with each 
represented in some sensible way such as two's-complement. 


[2] This is the way it was done at Bell Labs back in 1940 on George Stibitz's Complex Number Calculator [Irvine]. 


12-3 Other Bases 


The base - 1 - i has essentially the same properties as the base - 1 + i discussed above. If a certain bit pattern 
represents the number a + bi in one of these bases, then the same bit pattern represents the number a - bi in the 
other base. 


The bases 1 + i and 1 - i can also represent all the complex integers, using only 0 and 1 for digits. These two 
bases have the same complex-conjugate relationship to each other, as do the bases -1 + i. In bases 1 + i, the 
representation of some integers has an infinite string of 1's on the left, similar to the two's-complement 
representation of negative integers. This arises naturally by using uniform rules for addition and subtraction, as 
in the case of two's-complement. One such integer is 2, which (in either base) is written ...11101100. Thus, 
these bases have the rather complex addition rule 1 + 1 = ...11101100. 


By grouping into pairs the bits in the base -2 representation of an integer, one obtains a base 4 representation 
for the positive and negative numbers, using the digits -2, -1, 0, and 1. For example, 


“lA gecima = 110110; = (-1)(1)(-2), = fit [sd 9 FE 


Similarly, by grouping into pairs the bits in the base - 1 + i representation of a complex integer, we obtain a 
base -2i representation for the complex integers using the digits 0, 1, - 1 + i, and i. This is a bit too complicated 
to be interesting. 


The "quater-imaginary" system (Knu2) is similar. It represents the complex integers using 2i as a base, and the 
digits 0, 1, 2, and 3 (with no sign). To represent some integers, namely those with an odd imaginary 
component, it is necessary to use a digit to the right of the radix point. For example, i is written 10.2 in base 2i. 


12- 4 What Is the Most Efficient Base? 


Suppose you are building a computer and you are trying to decide what base to use to represent integers. For 
the registers you have available circuits that are 2-state (binary), 3-state, 4-state, and so on. Which should you 
use? 


Let us assume that the cost of a b-state circuit is proportional to b. Thus, a 3-state circuit costs 50% more than a 
binary circuit, a 4-state circuit costs twice as much as a binary circuit, and so on. 
Suppose you want the registers to be able to hold integers from 0 to some maximum M. Encoding integers from 


0 to M in base b requires [ log al + 1) Teigits (e.g., to represent all integers from 0 to 999,999 in decimal 
requires log,9(1,000,000) = 6 digits). 


One would expect the cost of a register to be equal to the product of the number of digits required times the cost 
to represent each digit: 


c = klog (M+ leh, 


where c is the cost of a register and k is a constant of proportionality. For a given M, we wish to find b that 
minimizes the cost. 


The minimum of this function occurs for that value of b that makes dc/db = 0. Thus, we have 


In( Ad +1) 


)= = kin(M+1 Lary =I 
Ind, 


by 


£ (kblogy (M+1)) = 4 “(kb 
dbi, 


This is zero when lnb = 1, orb = e. 


gi 
This is not a very satisfactory result. Because e “2.718, 2 and 3 must be the most efficient integral bases. 
Which is more efficient? The ratio of the cost of a base 2 register to the cost of a base 3 register is 


Thus, base 2 is more costly than base 3, but only by a small amount. 


By the same analysis, base 2 is more costly than base e by a factor of about 1.062. 


Chapter 13. Gray Code 


Gray Code 
Incrementing a Gray-Coded Integer 
Negabinary Gray Code 


Brief History and Applications 


13-1 Gray Code 


Is it possible to cycle through all 2” combinations of n bits by changing only one bit at a time? The answer is 
"yes," and this is the defining property of Gray codes. That is, a Gray code is an encoding of the integers such 
that a Gray-coded integer and its successor differ in only one bit position. This concept can be generalized to 
apply to any base, such as decimal, but here we will discuss only binary Gray codes. 


Although there are many binary Gray codes, we will discuss only one: the "reflected binary Gray code." This 
code is what is usually meant in the literature by the unqualified term "Gray code." We will show, usually 
without proof, how to do some basic operations in this representation of integers, and we will show a few 
surprising properties. 


The reflected binary Gray code is constructed as follows. Start with the strings 0 and 1, representing the 
integers 0 and 1: 


0 
l 


Reflect this about a horizontal axis at the bottom of the list, and place a 1 to the left of the new list entries, and a 
0 to the left of the original list entries: 


00 
ül 
l 1 
10 


This is the reflected binary Gray code for n = 2. To get the code for n = 3, reflect this and attach a 0 or 1 as 
before: 


000 
QO] 
O11 
O10 
110 
11 
10] 
100 


From this construction, it is easy to see by induction on n that (1) each of the 2” bit combinations appears once 
and only once in the list, (2) only one bit changes in going from one list entry to the next, and (3) only one bit 
changes when cycling around from the last entry to the first. Gray codes having this last property are called 
"cyclic," and the reflected binary Gray code is necessarily cyclic. 


If n > 2, there are non-cyclic codes that take on all 2” values once and only once. One such code is 000 001 011 
010 110 100 101 111. 


Figure 13-1 shows, for n = 4, the integers encoded in ordinary binary and in Gray code. The formulas show 
how to convert from one representation to the other at the bit-by-bit level (as it would be done in hardware). 


Figure 13-1. 4-bit Gray code and conversion formulas. 
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As for the number of Gray codes on n bits, notice that one still has a cyclic binary Gray code after rotating the 
list (starting at any of the 2” positions and cycling around) or reordering the columns. Any combination of these 
operations results in a distinct code. Therefore, there are at least 2” - n! cyclic binary Gray codes on n bits. 


There are more than this for n =3. 


The Gray code and binary representations have the following dual relationships, evident from the formulas 
given in Figure 13-1: 


e Biti of a Gray-coded integer is the parity of bit i and the bit to the left of i in the corresponding 
binary integer (using 0 if there is no bit to the left of i). 


e Biti of a binary integer is the parity of all the bits at and to the left of position i in the 
corresponding Gray-coded integer. 


Converting to Gray from binary can be done in only two instructions: 


GeB@(B4 1). 


The conversion to binary from Gray is harder. One method is given by 


M-I 
Be Gi. 
fe ü 


We have already seen this formula in "Computing the Parity of a Word" on page 74. As mentioned there, this 
formula can be evaluated as illustrated below for n = 32. 


B=GA (G > 1); 
B =BA^ (B> 2); 
B=BA^ (B >> 4); 
B =B ^ (B > 8); 
B=B8BA (B >> 16); 


, . 2+] logser |. 
Thus, in general it requires [ Br Toa 


Because it is so easy to convert from binary to Gray, it is trivial to generate successive Gray-coded integers: 


fo (i= 0; i= nj itr) { 
G= i^ (i > 1); 
output G; 


13-2 Incrementing a Gray-Coded Integer 


The logic for incrementing a 4-bit binary integer abcd can be expressed as follows, using Boolean algebra 
notation: 


d'=d 

c =cOd 
b= b@®ed 
a’ = a ® bed 


Thus, one way to build a Gray-coded counter in hardware is to build a binary counter using the above logic, and 
convert the outputs a', b', c', d' to Gray by forming the exclusive or of adjacent bits, as shown under "Gray from 


Binary" in Figure 13-1. 
A way that might be slightly better is described by the following formulas: 


p=e09/Dr Gh 
h =h®p 

ge = g@hp 

f =f@ghp 

e =e@fehp 


That is, the general case is 


G! = G,®(G,_,G,->...Gop), n22. 


r=] 


Because the parity p alternates between 0 and 1, a counter circuit might maintain p in a separate 1-bit register 
and simply invert it on each count. 


In software, the best way to find the successor G' of a Gray-coded integer G is probably simply to convert G to 
binary, increment the binary word, and convert it back to Gray code. Another way that's interesting and almost 


as good is to determine which bit to flip in G. The pattern goes like this, expressed as a word to be exclusive 
or'd to G: 


121412181214121 16 


The alert reader will recognize this as a mask that identifies the position of the leftmost bit that changes when 
incrementing the integer 0, 1, 2, 3, ..., corresponding to the positions in the above list. Thus, to increment a 
Gray-coded integer G, the bit position to invert is given by the leftmost bit that changes when 1 is added to the 
binary integer corresponding to G. 


This leads to the following algorithms for incrementing a Gray-coded integer G. They both first convert G to 
binary, which is shown as index (G). 


Figure 13-2 Incrementing a Gray-coded integer. 


B = index(G); B = index(G); 
B=B+ 1; M = ~B & (B + 1); 
Gp = B ^ (B > 1); Gp = G ^ M; 


A pencil-and-paper method of incrementing a Gray-coded integer is as follows: 


Starting from the right, find the first place at which the parity of bits at and to the left of the position is even. 
Invert the bit at this position. 


Or, equivalently: 
Let p be the parity of the word G. If p is even, invert the rightmost bit. 
If p is odd, invert the bit to the left of the rightmost 1-bit. 


The latter rule is directly expressed in the Boolean equations given above. 


13-3 Negabinary Gray Code 


If you write the integers in order in base -2, and convert them using the "shift and exclusive or" that converts to 
Gray from straight binary, you get a Gray code. The 3-bit Gray code has indexes that range over the 3-bit base - 
2 numbers, namely -2 to 5. Similarly, the 4-bit Gray code corresponding to 4-bit base -2 numbers has indexes 
ranging from -10 to 5. It is not a reflected Gray code, but it almost is. The 4-bit Gray code can be generated by 
starting with 0 and 1, reflecting this about a horizontal axis at the top of the list, and then reflecting it about a 
horizontal axis at the bottom of the list, and so on. It is cyclic. 


To convert back to base -2 from this Gray code, the rules are of course the same as they are for converting to 
straight binary from ordinary reflected binary Gray code (because these operations are inverses, no matter what 
the interpretation of the bit strings is). 


13-4 Brief History and Applications 


Gray codes are named after Frank Gray, a physicist at Bell Telephone Laboratories who in the 1930's invented 
the method we now use for broadcasting color TV in a way that's compatible with the black-and-white 
transmission and reception methods then in existence; that is, when the color signal is received by a black-and- 
white set, the picture appears in shades of gray. 


Martin Gardner [Gard] discusses applications of Gray codes involving the Chinese ring puzzle, the Tower of 
Hanoi puzzle, and Hamiltonian paths through graphs that represent hypercubes. He also shows how to convert 
from the decimal representation of an integer to a decimal Gray code representation. 


Gray codes are used in position sensors. A strip of material is made with conducting and nonconducting areas, 
corresponding to the 1's and 0's of a Gray-coded integer. Each column has a conducting wire brush positioned 
to read it out. If a brush is positioned on the dividing line between two of the quantized positions, so that its 
reading is ambiguous, then it doesn't matter which way the ambiguity is resolved. There can be only one 
ambiguous brush, and interpreting it as a 0 or 1 gives a position adjacent to the dividing line. 


The strip can instead be a series of concentric circular tracks, giving a rotational position sensor. For this 
application, the Gray code must be cyclic. Such a sensor is shown in Figure 13-3, where the four dots represent 


the brushes. 


Figure 13-3. Rotational position sensor. 


Chapter 14. Hilbert's Curve 


[1] 
In 1890 Giuseppe Peano discovered a planar curve with the rather surprising property that it is "space- 
filling." The curve winds around the unit square and hits every point (x, y) at least once. 


[4] Recall that a curve is a continuous map from a one-dimensional space to an n-dimensional space. 


Peano's curve is based on dividing each side of the unit square into three equal parts, which divides the square 
into nine smaller squares. His curve traverses these nine squares in a certain order. Then, each of the nine small 
squares is similarly divided into nine still smaller squares, and the curve is modified to traverse all these 
squares in a certain order. The curve can be described using fractions expressed in base 3; in fact, that's the way 
Peano first described it. 


In 1891 David Hilbert [Hil] discovered a variation of Peano's curve based on dividing each side of the unit 
square into two equal parts, which divides the square into four smaller squares. Then, each of the four small 
squares is similarly divided into four still smaller squares, and so on. For each stage of this division, Hilbert 
gives a curve that traverses all the squares. Hilbert's curve, sometimes called the "Peano-Hilbert curve," is the 
limit curve of this division process. It can be described using fractions expressed in base 2. 


Figure 14-1 shows the first three steps in the sequence that leads to Hilbert's space-filling curve, as they were 
depicted in his 1891 paper. 


Figure 14-1. First three curves in the sequence defining Hilbert's curve. 
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Here, we do things a little differently. We use the term "Hilbert curve" for any of the curves on the sequence 
whose limit is the Hilbert space-filling curve. The "Hilbert curve of order n" means the nth curve in the 
sequence. In Figure 14-1, the curves are of order 1, 2, and 3. We shift the curves down and to the left so that the 
corners of the curves coincide with the intersections of the lines in the boxes above. Finally, we scale the size of 
the order n curve up by a factor of 2”, so that the coordinates of the corners of the curves are integers. Thus, our 
order n Hilbert curve has corners at integers ranging from 0 to 2” - 1 in both x and y. We take the positive 
direction along the curve to be from (x, y) = (0, 0) to (2” - 1, 0). On the next page are shown the "Hilbert 
curves," in our terminology, of orders 1 through 6. 


14-1 A Recursive Algorithm for Generating the Hilbert Curve 


To see how to generate a Hilbert curve, examine the curves in Figure 14-2. The order 1 curve goes up, right, 
and down. The order 2 curve follows this overall pattern. First, it makes a U-shaped curve that goes up, in net 
effect. Second, it takes a unit step up. Third, it takes a U-shaped curve, a step, and another U, all to the right. 
Finally, it takes a step down, followed by a U that goes down, in net effect. 


Figure 14-2. Hilbert curves of orders 1-6. 
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The order 1 inverted U is converted into the order 2 Y-shaped curve. 


We can regard the Hilbert curve of any order as a series of U-shaped curves of various orientations, each of 
which, except for the last, is followed by a unit step in a certain direction. In transforming a Hilbert curve of 
one order to the next, each U-shaped curve is transformed into a Y-shaped curve with the same general 
orientation, and each unit step is transformed to a unit step in the same direction. 


The transformation of the order 1 Hilbert curve (a U curve with a net direction to the right and a clockwise 
rotational orientation) to the order 2 Hilbert curve goes as follows: 


1. Draw a U that goes up and has a counterclockwise rotation. 

2. Draw a step up. 

3. Draw a U that goes to the right and has a clockwise rotation. 
4. Draw a step to the right. 

5. Draw a U that goes to the right and has a clockwise rotation. 
6. Draw a step down. 

7. Draw a U that goes down and has a counterclockwise rotation. 


We can see by inspection that all U's that are oriented as the order 1 Hilbert curve are transformed in the same 
way. A similar set of rules can be made for transforming U's with other orientations. These rules are embodied 
in the recursive program shown in Figure 14-3 [Voor]. In this program, the orientation of a U curve is 
characterized by two integers that specify the net linear and the rotational directions, encoded as follows: 


dir = 0: right 

dir = 1: up 

dir = 2: left 

dir = 3: down 

rot = +1: clockwise 

rot = -1: counterclockwise 

Figure 14-3 Hilbert curve generator. 

void step(int); 

void hilbert(int dir, int rot, int order) { 
if (order == 0) return; 


dir = dir + rot; 

hilbert(dir, -rot, order - 1); 
step(dir); 

dir = dir - rot; 

hilbert(dir, rot, order - 1); 
step(dir); 

hilbert(dir, rot, order - 1); 
dir = dir - rot; 

step(dir); 

hilbert(dir, -rot, order - 1); 


Actually, dir can take on other values, but its congruency modulo 4 is what matters. 


Figure 14-4 shows a driver program and function Step that is used by program hilbert. This program is 
given the order of a Hilbert curve to construct, and it displays a list of line segments, giving for each the 
direction of movement, the length along the curve to the end of the segment, and the coordinates of the end of 
the segment. For example, for order 2 it displays 


© 0000 00 00 
© 0001 01 00 
1 0010 01 01 


2 0011 00 01 
1 0100 00 10 
1 0101 00 11 
0 0110 01 11 
1 0111 01 10 
0 1000 10 10 
1 1001 10 11 
© 1010 11 11 
-1 1011 11 10 
-1 1100 141 01 
-2 1101 10 01 
z! 1110 10 00 
(0) 1111 11 00 


Figure 14-4 Driver program for Hilbert curve generator. 


#include <stdio.h> 
#include <stdlib.h> 


int x = -1, y = 0; // Global variables. 
int 1 = Q; // Dist. along curve. 
int blen; // Length to print. 


void hilbert(int dir, int rot, int order); 


void binary(unsigned k, int len, char *s) { 
/* Converts the unsigned integer k to binary character 
form. Result is string s of length len. */ 


int i; 

s[len] = 0; 

for (i = len - 1; i >= 0; i--) { 
if (k & 1) s[i] = '1'; 
else s[i] = '0'; 
k= k> 1; 


$ 


void step(int dir) { 
char ii[33], xx[17], yy[17]; 


switch(dir & 3) { 
case 0: x = x + 1; break; 


case 1: y = y + 1; break; 
case 2: xX = X - 1; break; 
case 3: y = y - 1; break; 


} 

binary(i, 2*blen, ii); 

binary(x, blen, xx); 

binary(y, blen, yy); 

printf ("%5d %S mS %s\n", dir, ii, xx, yy); 

LS + 1 // Increment distance. 


int main(int argc, char *argv[]) { 
int order; 


order = atoi(argv[1]); 

blen = order; 

step(0); // Print init. point. 
hilbert(0, 1, order); 

return 0; 


14-2 Coordinates from Distance along the Hilbert Curve 


To find the (x, y) coordinates of a point located at a distance s along the order n Hilbert curve, observe that the 
most significant two bits of the 2n-bit integer s determine which major quadrant the point is in. This is because 
the Hilbert curve of any order follows the overall pattern of the order 1 curve. If the most significant two bits of 
s are 00, the point is somewhere in the lower left quadrant, if 01 it is in the upper left quadrant, if 10 it is in the 
upper right quadrant, and if 11 it is in the lower right quadrant. Thus, the most significant two bits of s 
determine the most significant bits of the n-bit integers x and y, as follows: 


Most significant two bits of s Most significant bits of (x, y) 


In any Hilbert curve, only four of the eight possible U-shapes occur. These are shown in Table 14-1 as graphics 
and as maps from two bits of s to a single bit of each of x and y. 


Observe from Figure 14-2 that in all cases, the U-shape represented by map AC [4 J becomes, at the next 
level of detail, a U-shape represented by maps B, A, A, or D, depending on whether the length traversed in the 
first-mentioned map A is 0, 1, 2, or 3, respectively. Similarly, a U-shape represented by map 


B ( 5 ) becomes, at the next level of detail, a U-shape represented by maps A, B, B, or C, depending on 
whether the length traversed in the first-mentioned map B is 0, 1, 2, or 3, respectively. 


These observations lead to the state transition table shown in Table 14-2, in which the states correspond to the 
mappings shown in Table 14-1. 


Table 14-1. The four possible mappings 


ý —> (0,0) Po —>+ (0, 0) P —}(1, 1) p =(1, 1) 


01 => (0, 1) 01 =>(1, 0) 01 => 1,0) 01 => (0, 1) 


10 =>(1, 1) 10 => (1, 1) 10 => (0, 0) 10 => (0, 0) 


i —(1, 0) É —+ (0, 1) i —} (0, 1) n ——>(1 0) 


Table 14-2. State transition table for computing (X, Y) from S 


If the current state is | and the next (to right) two bits of s are | then append to (x, y) | and enter state 


To use the table, start in state A. The integer s should be padded with leading zeros so that its length is 2n, 
where n is the order of the Hilbert curve. Scan the bits of s in pairs from left to right. The first row of Table 14- 


2 means that if the current state is A and the currently scanned bits of s are 00, then output (0, 0) and enter state 
B. Then, advance to the next two bits of s. Similarly, the second row means that if the current state is A and the 
scanned bits are 01, then output (0, 1) and stay in state A. 


The output bits are accumulated in left-to-right order. When the end of s is reached, the n-bit output quantities x 
and y are defined. 


As an example, suppose n = 3 and 


s= 110100. 


Because the process starts in state A and the initial bits scanned are 11, the process outputs (1, 0) and enters 
state D (fourth row). Then, in state D and scanning 01, the process outputs (0, 1) and stays in state D. Lastly, 
the process outputs (1, 1) and enters state C, although the state is now immaterial. 


Thus, the output is (101, 011)—that is, x = 5 and y= 3. 


A C program implementing these steps is shown in Figure 14-5. In this program, the current state is represented 
by an integer from 0 to 3 for states A through D, respectively. In the assignment to variable r OW, the current 
state is concatenated with the next two bits of S, giving an integer from 0 to 15, which is the applicable row 
number in Table 14-2. Variable r OW is used to access integers (expressed in hexadecimal) that are used as bit 
strings to represent the rightmost two columns of Table 14-2; that is, these accesses are in-register table 
lookups. Left-to-right in the hexadecimal values corresponds to bottom-to-top in Table 14-2. 


Figure 14-5 Program for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 


int i; 


unsigned state, xX, y, row; 


state = 0; // Initialize. 
c= ye g; 


for (i = 2*n - 2; i >= 0; 1 -= 2) { // Do n times. 
row = 4*state | (s >> i) & 3; // Row in table. 
xX = (x << 1) | (Ox936C >> row) & 1; 
y = (y << 1) | (O0x39C6 >> row) & 1; 
state = (Ox3E6B94C1 >> 2*row) & 3; // New state. 


} 
*Xp = X; // Pass back 
*yp = y; // results. 


[L&S] give a quite different algorithm. Unlike the algorithm of Figure 14-5, it scans the bits of s from right to 
left. It is based on the observation that one can map the least significant two bits of s to (x, y) based on the order 
1 Hilbert curve, and then test the next two bits of s to the left. If they are 00, the values of x and y just computed 
should be interchanged, which corresponds to reflecting the order 1 Hilbert curve about the line x = y. (Refer to 
the curves of orders 1 and 2 shown in Figure 14-1 on page 241.) If these two bits are 01 or 10, the values of x 
and y are not changed. If they are 11, the values of x and y are interchanged and complemented. These same 
rules apply as one progresses leftward along the bits of s. They are embodied in Table 14-3 and the code of 


Figure 14-6. It is somewhat curious that the bits can be prepended to x and y first, and then the swap and 
complement operations can be done, including these newly prepended bits; the results are the same. 


Figure 14-6 Lam and Shapiro method for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 


int i, sa, sb; 
unsigned x, y, temp; 


for (i = 0; i < 2*n; i += 2) 4 


sa = (s >> (i+1)) & 1; // Get bit i+1 of s. 

sb = (s >> i) & 1; // Get bit i of s. 

if ((sa ^ sb) == 0) { // If sa,sb = 00 or 11, 
temp = x; // swap X and y, 
x = yA(-Sa); // and if sa = 1, 
y = temp’(-sa); // complement them. 


x= (x >> 1) | (sa << 31); // Prepend sa to x and 
y = (y >> 1) | ((sa ^ sb) << 31); // (sadsb) to y. 
Í 
*xp = x >> (32 - n); // Right-adjust x and y 
*yp = y >> (32 - n); // and return them to 
} // the caller. 


In Figure 14-6, variables X and Y are uninitialized, which might cause an error message from some compilers. 
But the code functions correctly for whatever values X and y have initially. 


The branch in the loop of Figure 14-6 can be avoided by doing the swap operation with the "three exclusive or" 
trick given in Section 2-19 on page 38. 


Table 14-3. Lam and Shapiro method for computing (X, Y) from S 


a | =) eee | 


The if block can be replaced by the following code, where Swap and cmp are unsigned integers: 


swap = (sa ^ sb) - 1; // -1 if should swap, else O. 
cmpl = -(sa & sb); // -1 if should compl't, else 0. 
X= X Ay; 

y =y ^ (x & swap) ^ cmpl; 

x=x^y; 


However, this is nine instructions, versus about two or six for the if block, so the branch cost would have to be 
quite high for this to be a good choice. 


The "swap and complement" idea of [L&S] suggests a logic circuit for generating the Hilbert curve. The idea 


behind the circuit described below is that as you trace along the path of an order n curve, you basically map 
pairs of bits of s to (x, y) according to map A of Table 14-1. As the trace enters various regions, however, the 


mapping output gets swapped, complemented, or both. The circuit of Figure 14-7 keeps track of the swap and 
complement requirements of each stage, uses the appropriate mapping to map two bits of s to (x;, yi), and 


generates the swap and complement signals for the next stage. 


Figure 14-7. Logic circuit for incrementing (x, y) by one step along the Hilbert curve. 
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Assume there is a register containing the path length s and circuits for incrementing it. Then, to find the next 
point on the Hilbert curve, first increment s and then transform it as described in Table 14-4. This is a left-to- 
right process, which is a bit of a problem because incrementing s is a right-to-left process. Thus, the time to 
generate a new point on an order n Hilbert curve is proportional to 2n (for incrementing s) plus n (for 
transforming s to (x, y)). 


Figure 14-7 shows this computation as a logic circuit. In this figure, S denotes the swap signal and C denotes 
the complement signal. 


Table 14-4. Logic for computing (X, Y) from S 


| If the next (to right) two bits of s are then append to (x, y) | and set 
i (0, [L] swap = swap 
ý 1)[*] il change 


i change 


swap = swap, cmpl = empl 


P] Possibly swapped and/or complemented 


The logic circuit of Figure 14-7 suggests another way to compute (x, y) from s. Notice how the swap and 
complement signals propagate from left to right through the n stages. This suggests that it may be possible to 
use the parallel prefix operation to quickly (in logn steps rather than n - 1) propagate the swap and 
complement information to each stage, and then do some word-parallel logical operations to compute x and y, 
using the equations in Figure 14-7. The values of x and y are intermingled in the even and odd bit positions of a 
word, so they have to be separated by the unshuffle operation (see page 107). This might seem a bit 
complicated, and likely to pay off only for rather large values of n, but let us see how it goes. 


A procedure for this operation is shown in Figure 14-8 [GLS1]. The procedure operates on fullword quantities, 
so it first pads the input s on the left with '01' bits. This bit combination does not affect the swap and 
complement quantities. Next, a quantity CS (complement-swap) is computed. This word is of the form 


CSCS. . . CS, where each C (a single bit), if 1, means that the corresponding pair of bits is to be 
complemented, and each S means that the corresponding pair of bits is to be swapped, following Table 14-3. In 
other words, these two statements map each pair of bits of s as follows: 


S2;+1 S2, cs 


Figure 14-8 Parallel prefix method for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, int n, unsigned *xp, 
unsigned *yp) { 


unsigned comp, swap, cs, t, Sr; 


s = s | (0x55555555 << 2*n); 7/7 Pad s on left with 01 


sr = (s >> 1) & 0x55555555; // (no change) groups. 
cs = ((s & 0x55555555) + sr) // Compute complement & 
^ 055555555" // swap info in two-bit 
// groups. 


// Parallel prefix xor op to propagate both complement 
// and swap info together from left to right (there is 
// no step "cs A= cs >> 1", so in effect it computes 
// two independent parallel prefix operations on two 
// interleaved sets of sixteen bits). 


cs = cs ^ (cs >> 2); 

cs = cs A (cs >> 4); 

cs = cs ^ (cs >> 8); 

cs = cs ^ (cs >> 16); 

swap = cs & 0x55555555; // Separate the swap and 
comp = (cs >> 1) & 0x55555555; // complement bits. 


t = (s & swap) ^ comp; // Calculate x and y in 

S=S A SrATA (t << 1); // the odd & even bit 
// positions, resp. 

s =s & ((1 << 2*n) - 1); // Clear out any junk 


// on the left (unpad). 


// Now "unshuffle" to separate the x and y bits. 


t= (s * (Ss >> 1)) &. 0x22222227; «=e ATA (t << 1J; 
t= (s ^ (s >> 2)) & OxOCOCOCOC; s=s AtA(t << 2); 
t = (s ^ (s >> 4)) & OxOOFOOOFO; s=S ATA (t << 4); 
t= (s ^ (s >> 8)) & OxOOOOFFOO; s=SAtA (t << 8); 
*xp = s >> 16; // Assign the two halves 
*yp = s & OXFFFF; // of t to x and y. 


This is the quantity to which we want to apply the parallel prefix operation. PP-XOR is the one to use, going 
from left to right, because successive 1-bits meaning to complement or to swap have the same logical properties 
as exclusive or: Two successive 1-bits cancel each other. 


Both signals (complement and swap) are propagated in the same PP-XOR operation, each working with every 


other bit of CS. 


The next four assignment statements have the effect of translating each pair of bits of S into (X, Y) values, with 
X being in the odd (leftmost) bit positions, and y being in the even bit positions. Although the logic may seem 
obscure, it is not difficult to verify that each pair of bits of S is transformed by the logic of the first two 
Boolean equations in Figure 14-7. (Suggestion: Consider separately how the even and odd bit positions are 
transformed, using the fact that t and Sr are 0 in the odd positions.) 


The rest of the procedure is self-explanatory. It executes in 66 basic RISC instructions (constant, branch-free), 
versus about 19n + 10 (average) for the code of Figure 14-6 (based on compiled code; includes prologs and 


epilogs, which are essentially nil). Thus, the parallel prefix method is faster for n 23, 


14-3 Distance from Coordinates on the Hilbert Curve 


Given the coordinates of a point on the Hilbert curve, the distance from the origin to the point can be calculated 
by means of a state transition table similar to Table 14-2. Table 14-5 is such a table. 


Its interpretation is similar to that of the previous section. First, x and y should be padded with leading zeros so 
that they are of length n bits, where n is the order of the Hilbert curve. Second, the bits of x and y are scanned 
from left to right, and s is built up from left to right. 

A C program implementing these steps is shown in Figure 14-9. 


Figure 14-9 Program for computing s from (x, y). 

unsigned hil_s_from_xy(unsigned x, unsigned y, int n) { 
int i? 
unsigned state, s, row; 


state = 0; // Initialize. 
s = 0; 


for (1 =n - 1; i >= 0; 1i--) { 
row = 4*state | 2*((x >> i) &1) | (y >> i) & 1; 
s = (s << 2) | (Ox361E9CB4 >> 2*row) & 3; 
state = (Ox8FE65831 >> 2*row) & 3; 


j 


return sS; 


[L&S] give an algorithm for computing s from (x, y) that is similar to their algorithm for going in the other 
direction (Table 14-3). It is a left-to-right algorithm, shown in Table 14-6 and Figure 14-10. 


Figure 14-10 Lam and Shapiro method for computing s from (x, y). 
unsigned hil_s_from_xy(unsigned x, unsigned y, int n) { 


int i, xi, yi; 
unsigned s, temp; 


s = 0; // Initialize. 


for (i =n - 1; i >= 0; i--) { 


xi = (x >> i) & 1; // Get bit i of x. 
yi = (y >> i) & 1; // Get bit i of y. 
ir (yl == Oy: { 
temp = X; // Swap x and y and, 
x = yA(-x1i); // if xi = 1, 
y = temp^(-xi); // complement them. 
} 
s = 4*s + 2*xi + (xi^yi); // Append two bits to s. 
} 
return sS; 


WY 


Table 14-5. State transition table for computing S from (X, Y) 


| If the current state is | and the next (to right) two bits of (x, y) are | then append tos | and enter state 


RO B 
FP 1) 01 A 
A (1, 0) 11 D 
Po Æ 1) 10 A 
PP 0) 00 A 
en ius 11 C 
Pp 0) 01 B 
Po fd ni 
F P 1) 11 PO 
F (1, 0) 01 c 


Table 14-6. Lam and Shapiro method for computing S from (X, Y) 


| If the next (to right) two bits of (x, y) are | then | and append to s 


(0, 0) Swap x and y 00 
(0, 1) No change 01 
(1, 0) Swap and complement x and y 11 


j 1) fs change i 


14-4 Incrementing the Coordinates on the Hilbert Curve 


Given the (x, y) coordinates of a point on the order n Hilbert curve, how can one find the coordinates of the next 
point? One way is to convert (x, y) to s, add 1 to s, and then convert the new value of s back to (x, y), using 
algorithms given above. 


A slightly (but not dramatically) better way is based on the fact that as one moves along the Hilbert curve, at 
each step either x or y, but not both, is either incremented or decremented (by 1). The algorithm to be described 
scans the coordinate numbers from left to right to determine the type of U-curve that the rightmost two bits are 
on. Then, based on the U-curve and the value of the rightmost two bits, it increments or decrements either x or y. 


That's basically it, but there is a complication when the path is at the end of a U-curve (which happens once 
every four steps). At this point, the direction to take is determined by the previous bits of x and y and by the 
higher order U-curve with which these bits are associated. If that point is also at the end of its U-curve, then the 
previous bits and the U-curve there determine the direction to take, and so on. 


Table 14-7 describes this algorithm. In this table, the A, B, C, and D denote the U-curves as shown in Table 14- 
1 on page 246. To use the table, first pad x and y with leading zeros so they are n bits long, where n is the order 
of the Hilbert curve. Start in state A and scan the bits of x and y from left to right. The first row of Table 14-7 


means that if the current state is A and the currently scanned bits are (0, 0), then set a variable to indicate to 
increment y, and enter state B. The other rows are interpreted similarly, with a suffix minus sign indicating to 
decrement the associated coordinate. A dash in the third column means do not alter the variable that keeps track 
of the coordinate changes. 


After scanning the last (rightmost) bits of x and y, increment or decrement the appropriate coordinate as 
indicated by the final value of the variable. 


A C program implementing these steps is shown in Figure 14-11. Variable dX is initialized in such a way that 
if invoked many times, the algorithm cycles around, generating the same Hilbert curve over and over again. 
(However, the step that connects one cycle to the next is not a unit step.) 

Figure 14-11 Program for taking one step on the Hilbert curve. 


void hil_inc_xy(unsigned *xp, unsigned *yp, int n) { 


int i; 
unsigned x, y, state, dx, dy, row, dochange; 
x = *Xþ; 


y = “yp; 
state = 0; // Initialize. 


dx = -((1 << n) - 1); // Init. -(2**n - 1). 
dy = 0; 
for (i = n-1; i >= 0; i--) { // Do n times. 


row = 4*state | 2*((x >> i) & 1) | (y > 1) & 1; 
dochange = (©xBDDB >> row) & 1; 
if (dochange) { 
dx = ((0x16451659 >> 2*row) & 3) - 1; 
1 


dy = ((0x51166516 >> 2*row) & 3) - 1; 
} 
state = (0x8FE65831 >> 2*row) & 3; 
} 
*xp = *xp + dx; 
*yp = *yp + dy; 


Table 14-7. Taking one step on the Hilbert curve 


Table 14-7 can readily be implemented in logic, as shown in Figure 14-12. In this figure, the variables have the 
following meanings: 


Figure 14-12. Logic circuit for incrementing (x, y) by one step along the Hilbert curve. 
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S and C together identify the "state" of Table 14-7, with (C, S) = (0, 0), (0, 1), (1, 0), and (1, 1) denoting states 
A, B, C, and D, respectively. The output signals are Iọ and Wo, which tell, respectively, whether to increment or 
decrement, and which variable to change. (In addition to the logic shown, an incrementer/decrementer circuit is 


required, with MUX's to route either x or y to the incrementer/ decrementer, and a circuit to route the altered 
value back to the register that holds x or y. Alternatively, two incrementer/decrementer circuits could be used.) 


14-5 Non-recursive Generating Algorithms 


The algorithms of Tables 14-2 and 14-7 provide two non-recursive algorithms for generating the Hilbert curve 
of any order. Either algorithm can be implemented in hardware without great difficulty. Hardware based on 
Table 14-2 includes a register holding s, which it increments for each step, and then converts to (x, y) 
coordinates. Hardware based on Table 14-7 would not have to include a register for s, but the algorithm is more 


complicated. 


14-6 Other Space-Filling Curves 


As was mentioned, Peano was first, in 1890, to discover a space-filling curve. The many variations discovered 
since then are often called "Peano curves." One interesting variation of Hilbert's curve was discovered by 
Eliakim Hastings Moore in 1900. It is "cyclic" in the sense that the end point is one step away from the starting 
point. The Peano curve of order 3, and the Moore curve of order 4, are shown in Figure 14-13. Moore's curve 


has an irregularity in that the order 1 curve is up-right-down ( ly but this shape does not appear in the 
higher-order curves. Except for this minor exception, the algorithms for dealing with Moore's curve are very 
similar to those for the Hilbert curve. 


Figure 14-13. Peano (left) and Moore (right) curves. 


The Hilbert curve has been generalized to arbitrary rectangles and to three and higher dimensions. The basic 
building block for a 3-dimensional Hilbert curve is shown below. It hits all eight points of a 2x2x2 cube. These 
and many other space-filling curves are discussed in [Sagan]. 


14-7 Applications 


Space-filling curves have applications in image processing: compression, halftoning, and textural analysis 
[L&S]. Another application is to improve computer performance in ray tracing, a graphics-rendering technique. 
Conventionally, a scene is scanned by projecting rays across the scene in ordinary raster scan line order (left to 
right across the screen, and then top to bottom). When a ray hits an object in the simulated scene's database, the 
color and other properties of the object at that point are determined and the results are used to illuminate the 
pixel through which the ray was sent. (This is an oversimplification, but it's adequate for our purposes.) One 
problem is that the database is often large and the data on each object must be paged in and cast out as various 
objects are hit by the scanning ray. When the ray scans across a line, it often hits many objects that were hit in 
the previous scan, requiring them to be paged in again. Paging operations would be reduced if the scanning had 
some kind of locality property. For example, it might be helpful to scan a quadrant of the screen completely 
before going on to another quadrant. 


The Hilbert curve seems to have the locality property we are seeking. It scans a quadrant completely before 
scanning another, recursively, and also does not make a long jump when going from one quadrant to another. 


Douglas Voorhies [Voor] has simulated what the paging behavior would likely be for the conventional uni- 
directional scan line traversal, the Peano curve, and the Hilbert curve. His method is to scatter circles of a given 
size randomly on the screen. A scan path hitting a circle represents touching a new object, and paging it in. But 
when a scan leaves a circle, it is presumed that the object's data remains in memory until the scan exits a circle 
of radius twice that of the "object" circle. Thus, if the scan leaves the object for just a short distance and then 
returns to it, it is assumed that no paging operation occurred. He repeats this experiment for many different 
sizes of circles, on a simulated 1024 x 1024 screen. 


Assume that entering an object circle and leaving its surrounding circle represent one paging operation. Then, 
clearly the normal scan line causes D paging operations in covering a (not too big) circle of diameter D pixels, 
because each scan line that enters it leaves its outer circle. The interesting result of Voorhies's simulation is that 
for the Peano curve, the number of paging operations to scan a circle is about 2.7 and, perhaps surprisingly, is 
independent of the circle's diameter. For the Hilbert curve the figure is about 1.4, also independent of the 
circle's diameter. Thus, the experiment suggests that the Hilbert curve is superior to the Peano curve, and vastly 
superior to the normal scan line path, in reducing paging operations. (The result that the page count is 
independent of the circles' diameters is probably an artifact of the outer circle's being proportional in size to the 
object circle.) 


Chapter 15. Floating-Point 


God created the integers, all else is the work of man. 
—Leopold Kronecker 


Operating on floating-point numbers with integer arithmetic and logical instructions is often a messy 
proposition. This is particularly true for the rules and formats of the IEEE Standard for Binary Floating-Point 
Arithmetic, IEEE Std. 754-1985, commonly known as "IEEE arithmetic." It has the NaN (not a number) and 
infinities, which are special cases for almost all operations. It has plus and minus zero, which must compare 
equal to one another. It has a fourth comparison result, "unordered." The most significant bit of the fraction is 
not explicitly present in "normal" numbers, but it is in "denormalized" or "subnormal" numbers. The fraction is 
in signed-true form and the exponent is in biased form, whereas integers are now almost universally in two's- 
complement form. There are of course reasons for all this, but it results in programs that are full of compares 
and branches, and that present a challenge to implement efficiently. 


We assume the reader has some familiarity with the IEEE standard, and thus summarize it here only very 
briefly. 


15-1 IEEE Format 


We will restrict our attention to the single and double formats (32- and 64-bit) described in IEEE 754. The 
standard also describes "single extended" and "double extended" formats, but they are only loosely described 
because the details are implementation-dependent (e.g., the exponent width is unspecified in the standard). The 
single and double formats are shown below. 


Single format Double format 
1 8 23 1 ll 52 


The sign bit s is encoded as 0 for plus, 1 for minus. The biased exponent e and fraction f are magnitudes with 
their most significant bits on the left. The floating-point value represented is encoded as shown on the next 


page. 
Single formal Double format 
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As an example, consider encoding the number 7 in single format. In binary [Knu1], 


m= 110010 0100 0011 1111 0110 1010 1000 1000 1000 O101 1010 0011 0000 10... 


This is in the range of the "normalized" numbers shown in the third row of the table above. The most 
significant 1 in 7 is dropped, as the leading 1 is not stored in the encoding of normalized numbers. The 
exponent e - 127 should be 1, to get the binary point in the right place, and hence e = 128. Thus, the 
representation is 


© 10000000 10010010000111111011011 


or, in hexadecimal, 


40490FDB, 


where we have rounded the fraction to the nearest representable number. 


Numbers with 1 = S e S254 are called "normalized numbers." These are in "normal" form, meaning that their 
most significant bit is not explicitly stored. Nonzero numbers with e = 0 are called "denormalized numbers," or 
simply "denorms." Their most significant bit is explicitly stored. This scheme is sometimes called "gradual 
underflow." Some extreme values in the various ranges of floating-point number are shown in Table 15-1. In 


this table "Max integer" means the largest integer such that all integers less than or equal to it, in absolute 
value, are representable exactly; the next integer is rounded. 


For normalized numbers, one unit in the last position (ulp) has a relative value ranging from 1/224 to 1/233 
(about 5.96 x 10 to 1.19 x 10-7) for single format, and from 1/2°° to 1/252 (about 1.11 x 10-16 to 2.22 x 10-16) 
for double format. The maximum "relative error," for round to nearest mode, is half of those figures. 


The range of integers that is represented exactly is from -224 to +224 (-16,777,216 to +16,777,216) for single 
format, and from -2°3 to =253 (-9,007,119,254,740,992 to +9,007,199,254,740,992) for double format. Of 
course, certain integers outside these ranges, such as larger powers of 2, can be represented exactly; the ranges 
cited are the maximal ranges for which all integers are represented exactly. 


Table 15-1. Extreme Values 


Single Precision 
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Double Precision 
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One might want to change division by a constant to multiplication by the reciprocal. This can be done with 
complete (IEEE) accuracy only for numbers whose reciprocals are represented exactly. These are the powers of 
2 from 2-127 to 2127 for single format, and from 2-1023 to 21023 for double format. The numbers 2-127 and 2-1023 
are denormalized numbers, which are best avoided on machines that implement operations on denormalized 
numbers inefficiently. 


15-2 Comparing Floating-Point Numbers Using Integer Operations 


One of the features of the IEEE encodings is that non-NAN values are properly ordered if treated as signed 
magnitude integers. 


To program a floating-point comparison using integer operations, it is necessary that the "unordered" result not 
be needed. In IEEE 754, the unordered result occurs when one or both comparands are NaNs. The methods 
below treat NaNs as if they were numbers greater in magnitude than infinity. 


The comparisons are also much simpler if -0.0 may be treated as strictly less than +0.0 (which is not in 
accordance with IEEE 754). Assuming this is acceptable, the comparisons may be done as shown below, where 


i ! J Sa 
=' Stand =denote floating-point comparisons, and the “symbol is used as a reminder that these formulas 


do not treat +0.0 quite right. 


aib=(a=b) 
aib=(a20&a<hb) lia<O&ath) 


aib=(az20&asb) | (a<0&aZb) 


If -0.0 must be treated as equal to +0.0, there does not seem to be any very slick way to do it, but the following 
formulas, which follow more or less obviously from the above, are possibilities. 
aib=(a=6) | (-a=a&-b=5) 
=(a=5) | iia | 6) = Ox80000000) 
=(a=6) | (((a | 6) & OxX7FFFFFFF) = 0) 
a Lb B((a20&a<h) | (a<0&a56)) & (fa | b) z 0x80000000) 


aib=(a20&a<b) | (a<O0&atb) | ((a | b) = Ox80000000) 


In some applications, it might be more efficient to first transform the numbers in some way, and then do a 
floating-point comparison with a single fixed-point comparison instruction. For example, in sorting n numbers, 
the transformation would be done only once to each number, whereas a comparison must be done at least 


ow |. . ete 
[n logan Tine (in the minimax sense). 


Table 15-2 gives four such transformations. For those in the left column, -0.0 compares equal to +0.0, and for 


those in the right column, -0.0 compares less than +0.0. In all cases, the sense of the comparison is not altered 
by the transformation. Variable N is signed, t is unsigned, and C may be either signed or unsigned. 


The last row shows branch-free code that can be implemented on our basic RISC in four instructions for the left 
column, and three for the right column (these four or three instructions must be executed for each comparand). 


15-3 The Distribution of Leading Digits 


When IBM introduced the System/360 computer in 1964, numerical analysts were horrified at the loss of 
precision of single-precision arithmetic. The previous IBM computer line, the 704 - 709 - 7090 family, had a 36- 
bit word. For single-precision floating-point, the format consisted of a 9-bit sign and exponent field, followed 
by a 27-bit fraction in binary. The most significant fraction bit was explicitly included (in "normal" numbers), 
so quantities were represented with a precision of 27 bits. 


Table 15-2. Preconditioning Floating-Point Numbers for Integer Comparisons 


-0.0 = +0.0 (IEEE) -0.0 < +0.0 (non-IEEE) 


if (n >= 0) n = n+0x80000000; if (n >= 0) n = n+0x80000000; 
else n = -n; USe unsigned else n = ~n; Use unsigned 
comparison. comparison. 


c = OX7FFFFFFF; if (n < 0) n= (n ic = OX7FFFFFFF; if (n< 0) n 
c) + 1; Use signed comparison. c; Use signed comparison. 


O0x80000000; if (n< 0) n=c OxX7FFFFFFF; if (n< 0) n 
; Use Signed comparison. ; Use Signed comparison. 


=n >> 31;n = (n ^ (t >> 1)) - ļt (unsigned) (n>>30) >> 1; 
t; Use signed comparison. n t; Use signed comparison. 


The 5/360 has a 32-bit word. For single-precision, IBM chose to have an 8-bit sign and exponent field followed 
by a 24-bit fraction. This drop from 27 to 24 bits was bad enough, but it gets worse. To keep the exponent 
range large, a unit in the 7-bit exponent of the S/360 format represents a factor of 16. Thus, the fraction is in 
base 16, and this format came to be called "hexadecimal" floating-point. The leading digit can be any number 
from 1 to 15 (binary 0001 to 1111). Numbers with leading digit 1 have only 21 bits of precision (because of the 
three leading 0's), but they should constitute only 1/15 (6.7%) of all numbers. 


No, it's worse than that! There was a flurry of activity to show, both analytically and empirically, that leading 
digits are not uniformly distributed. In hexadecimal floating-point, one would expect 25% of the numbers to 
have leading digit 1, and hence only 21 bits of precision. 


Let us consider the distribution of leading digits in decimal. Suppose you have a large set of numbers with 
units, such as length, volume, mass, speed, and so on, expressed in "scientific" notation (e.g., 6.022 x 1023). If 
the leading digit of a large number of such numbers has a well-defined distribution function, then it must be 


independent of the units—whether inches or centimeters, pounds or kilograms, and so on. Thus, if you multiply 
all the numbers in the set by any constant, the distribution of leading digits should be unchanged. For example, 
considering multiplying by 2, we conclude that the number of numbers with leading digit 1 (those from 1.0 to 
1.999... times 10 to some power) must equal the number of numbers with leading digit 2 or 3 (those from 2.0 
to 3.999... times 10 to some power), because it shouldn't matter if our unit of length is inches or half inches, or 
our unit of mass is kilograms or half kilograms, and so on. 


Let f(x), for 1 Sy< 10, be the probability density function for the leading digits of the set of numbers with 
units. f(x) has the property that 


f fixedly 


is the proportion of numbers that have leading digits ranging from a to b. Referring to the figure below, for a 
small increment Ax, in x, f must satisfy 


fl) Ax = fix): xAx 


Av 7 


because f(1) - Ax is, approximately, the proportion of numbers ranging from 1 to 1 + Ax (ignoring a multiplier 
of a power of 10), and f(x) - xAx is the approximate proportion of numbers ranging from x to x + xAx. Because 
the latter set is the first set multiplied by x, their proportions must be equal. Thus, the probability density 
function is a simple inverse relationship, 


fix) = fl)/x. 


Because the area under the curve from x = 1 to x = 10 must be 1 (all numbers have leading digits from 1.000... 


to 9.999...), it is easily shown that 


AL) = 1/7 in 10. 


The proportion of numbers with leading digits in the range a to b, with 1 Sa Sp < 10, is 


l h 
~ de Ine [|] Iinbra _ | h 
h in r a = l0Zi07- 
t 


kxlnio InlO} In 0 


Thus, in decimal, the proportion of numbers with leading digit 1 is log; (2/1) 0.30103, and the proportion of 
numbers with leading digit 9 is log,)(10/9) 0.0458. 


For base 16, the proportion of numbers with leading digits in the range a to b, with 1 Sa Sp < 16, is similarly 


derived to be log,,(b/a). Hence the proportion of numbers with leading digit 1 is log,¢(2/1) = 1/log 16 = 0.25. 


15-4 Table of Miscellaneous Values 


Table 15-3 shows the IEEE representation of miscellaneous values that may be of interest. The values that are 
not exact are rounded to the nearest representable value. 


Table 15-3. Miscellaneous Values 


Decimal Single Format (Hex) Double Format (Hex) 
-00 FF80 0000 FFFO 0000 0000 0000 
-2.0 C000 0000 C000 0000 0000 0000 
-1.0 B-80 0000 BFFO 0000 0000 0000 
-0.5 BFOO 0000 BFEO 0000 0000 0000 
-0.0 8000 0000 8000 0000 0000 0000 
+0.0 0000 0000 0000 0000 0000 0000 
Smallest positive denorm 0000 0001 0000 0000 0000 0001 
Largest denorm 007/F FFFF OS0F FFFF FFFF FFFF 
Least positive normalized 0080 0000 0010 0000 0000 0000 
e (sai kes DF46 A252 9D39 
0.1 3DCC CCCD 3FB9 9999 9999 999A 
logi9 2 (0.3010...) 3E9A 209B 3FD3 4413 509F /9FF 
1/e (0.3678...) SEBC 5AB2 SPD¢ 8B56 362C EF38 


pe 10 (0.4342...) ee 5BD9 3FDB CB/B 1526 ES0E 


0.5 3F00 0000 


In 2 (0.6931...) 3F31 7218 
14/2 (0.7071...) SF35 O4F3 
1/n 3 (0.9102...) 3F69 0570 
1.0 3F80 0000 
In 3 (1.0986...) 3F8C 9F54 
F aaao 3FB5 04F3 
1/ln 2 (1.442...) 3FB8 AA3B 
J 732...) 3FDD B3D7 
2.0 4000 0000 
In 10 2.302...) 4013 5D8E 
e (2.718...) 402D F854 
3.0 4040 0000 

(3.141...) 4049 OFDB 
JO 3.162...) 404A 62C2 
pe 10 (3.321...) 4054 9A78 
4.0 4080 0000 


P 40A0 0000 


3FF6 AO9E 66/F 3BCD 


3FF/ 154/ 652B 82FE 


3FFB B6/A E858 4CAA 


4000 0000 0000 0000 


4002 6BB1 BBB5 5516 


4005 BFOA 8B14 5/69 


4008 0000 0000 0000 


4009 21FB 5444 2D18 


4009 4C58 SADA 5B53 


400A 934F 0979 A371 


4010 0000 0000 0000 


4014 0000 0000 0000 


7.0 40EO0 0000 


8.0 4100 0000 

9.0 4110 0000 

10.0 4120 0000 

11.0 4130 0000 4026 0000 0000 0000 
12.0 4140 0000 4028 0000 0000 0000 
13.0 4150 0000 402A 0000 0000 0000 
14.0 4160 0000 402C 0000 0000 0000 
15.0 4170 0000 402E 0000 0000 0000 
16.0 4180 0000 4030 0000 0000 0000 
180/7 (57.295...) 4265 2EE1 404C ASDC 1A63 C1F8 
223.1 4AFF FFFE 415F FFFF C000 0000 
223 4B00 0000 4160 0000 0000 0000 
924 _ 4 4B/7F FFFF 416F FFFF E000 0000 
224 4B80 0000 41/0 0000 0000 0000 
231-1 4FOO 0000 41DF FFFF FFCO 0000 


231 4FOO 0000 41E0 0000 0000 0000 


232.1 4F80 0000 


232 ee 0000 
| 5980 0000 


pep 
PD NS 
Largest normalized fer Peer (Per reer Peer Pree 

20 1F80 0000 1FFO 0000 0000 0000 

"Smallest" SNaN /F80 0001 /FFO 0000 0000 0001 

"Largest" SNaN (FBP _FFFF fee? PEPE eee Peer 

"Smallest" QNaN 1FCO 0000 /FF8 0000 0000 0000 


— ope FFFF 7FFE FFF FFFF FFF 


IEEE 754 does not specify how the signaling and quiet NaNs are distinguished. Table 15-3 uses the convention 
employed by PowerPC, the AMD 29050, the Intel x86 and 1860, and the Fairchild Clipper: The most 
significant fraction bit is 0 for signaling and 1 for quiet NaN's. The Compaq Alpha, HP PA-RISC, and MIPS 
computers use the same bit to make the distinction, but in the opposite sense (0 = quiet, 1 = signaling). 


Chapter 16. Formulas for Primes 


Introduction 
Willans's Formulas 
Wormell's Formula 


Formulas for Other Difficult Functions 


16-1 Introduction 


Like many young students, I once became fascinated with prime numbers, and tried to find a formula for them. 
I didn't know exactly what operations would be considered valid in a "formula," or exactly what function I was 
looking for—a formula for the nth prime in terms of n, or in terms of the previous prime(s), or a formula that 
produces primes but not all of them, and so on. Nevertheless, in spite of these ambiguities, I would like to 
discuss a little of what is known about this problem. We will see that (a) there are formulas for primes, and (b) 
none of them are very satisfying. 


Much of this subject relates to the present work in that it deals with formulas similar to those of some of our 
programming tricks, albeit in the domain of real number arithmetic rather than "computer arithmetic." But let 
us first review a few highlights from the history of this subject. 


In 1640, Fermat conjectured that the formula 


always produces a prime, and numbers of this form have come to be called "Fermat numbers." It is true that F, 


is prime for n ranging from 0 to 4, but Euler found in 1732 that 


Fo = 27 +1 = 641 - 6700417, 


(We have seen these factors before in connection with dividing by a constant on a 32-bit machine). Then, F. 
Landry showed in 1880 that 


F, = 27°41 = 274177 67280421310721. 


F,, is now known to be composite for many larger values of n, such as all n from 7 to 16 inclusive. For no value 


[1] 


of n > 4 is it known to be prime [H&W]. So much for rash conjectures. 


[4] However, this is the only conjecture of Fermat known to be wrong [Wells]. 


Incidentally, why would Fermat be led to the double exponential? He knew that if m has an odd factor other 
than 1, then 2™ + 1 is composite. For if m = ab with b odd and not equal to 1, then 


Jah | = (244 1)(2alb- 1) _ palh-2)4 Qath-H)_ 4 1), 


Knowing this, he must have wondered about 2™ + 1 with m not containing any odd factors (other than 1)—that 


is, m= 2". He tried a few values of n and found that 2%" + l seemed to be prime. 


Certainly everyone would agree that a polynomial qualifies as a "formula." One rather amazing polynomial was 
discovered by Leonhard Euler in 1772. He found that 


fla) =n +n 


is prime-valued for every n from 0 to 39. His result can be extended. Because 


Ji=n) = pł=n+ 4l = f(n=1), 


f(-n) is prime-valued for every n from 1 to 40; that is, f(n) is prime-valued for every n from -1 to -40. Therefore, 


fin =40) = {n= 40) + {n= 40) +41 = nte Tn + 1601 


is prime-valued for every n from 0 to 79. (However, it is lacking in aesthetic appeal because it is nonmonotonic 
and it repeats; that is, for n = 0, 1, ..., 79, n2 - 79n + 1601 = 1601, 1523, 1447, ..., 43, 41, 41, 43, ..., 1447, 
1523, 1601.) 


In spite of this success, it is now known that there is no polynomial f(n) that produces a prime for every n (aside 
from constant polynomials such as f(n) = 5). In fact, any nontrivial "polynomial in exponentials" is composite 
infinitely often. More precisely, as stated in [H&W], 


Theorem. If f(n) = P(n, 2", 3", ..., k") is a polynomial in its arguments, with integral coefficients, and f(n) > 
oowhen n => ceo, then f(n) is composite for an infinity of values of n. 


Thus, a formula such as n2 - 2" + 2n3 + 2n + 5 must produce an infinite number of composites. On the other 


TA 
hand, the theorem says nothing about formulas containing terms such as 2° +n", and n!. 


A formula for the nth prime in terms of n can be obtained by using the floor function and a magic number 


a = 0.20300500070001 1000013... 


The number a is, in decimal, the first prime written in the first place after the decimal point, the second prime 
written in the next two places, the third prime written in the next three places, and so on. There is always room 
for the n th prime, because p,, < 10”. We will not prove this, except to point out that it is known that there is 


always a prime between n and 2n (for n =), and hence certainly at least one between n and 10n, from which it 
follows that p„ < 10". The formula for the nth prime is 


+H a? — 
P = [10 ? a |- 10r 10 z al, 


where we have used the relation 1 +2 +3 +... +n = (n2 + n)/2. For example, 


p, = | 10% |- 105] 10a | 
= 203005 — 203000 
= $, 


This is a pretty cheap trick, as it requires knowledge of the result to define a. The formula would be interesting 
if there were some way to define a independently of the primes, but no one knows of such a definition. 


Obviously, this technique can be used to obtain a formula for many sequences, but it begs the question. 


16-2 Willans's Formulas 


C. P. Willans gives the following formula for the nth prime [Will]: 


A 


=| zñ 
m 2 fe {1} 
p,=1i+¥ vil b3 | cos renie |) i 
x=] 


The derivation starts from Wilson's theorem, which states that p is prime or 1 if and only if (p - 1)! =-1(mod 
p). Thus, 


(x—I)l +1 
x 


is an integer for x prime or x = 1 and is fractional for all composite x. Hence 
Equation 1 


F(x) = coset = l)! + | = l, x prime orl, 
| x Ü, x composite. 


l < 
Thus, if n(m) denotes the number of primes =m, 


[2] Our apologies for the two uses of ņ in close proximity, but it's standard notation and shouldn't cause any difficulty. 
Equation 2 


mim) = -l+ = F(x). 


r= 1 


Observe that 2(p,,) = n, and furthermore, 


nat) <n, for m € p, and 


Tim) =n, for m >p. 


Therefore, the number of values of m from 1 to efor which is n(m) < n is p, - 1. That is, 
Equation 3 


p, = 1+ 5 (m(m) en), 


ar = I 


where the summand is a "predicate expression" (0/1-valued). 
Because we have a formula for n(m), Equation (3) constitutes a formula for the nth prime as a function of n. 


But it has two features that might be considered unacceptable: an infinite summation and the use of a "predicate 
expression," which is not in standard mathematical usage. 


It has been proved that for n 21 there is at least one prime between n and 2n. Therefore, the number of primes 


Son is at least n—that is, 7(2") =n. Thus, the predicate m(m) < nis 0 form =n, so the upper limit of the 
summation above can be replaced with 2”. 


Willans has a rather clever substitute for the predicate expression. Let 


LT(x, y) = Wk for x = 0 l, 4...3 y = 1,2, .... 
L+x 


Then, if x <y, 1 Sya +x) Sy, so I ENy/ {l +x)}S vy < 2. Furthermore, if x ay, then 0 < y/(1 + x) <1, 
SO O<y/(l+x)<1, so OS Yy/(1 +x) < l. Applying the floor function, we have 


l, for xay, 


LT(x, y) = 
O, for x= y, 


That is, LT(x, y) is the predicate x < y (for x and y in the given ranges). 


Substituting, Equation (3) can be written 


Pp, = l+ y LT (mst), n) 
=I 


iF = 


14 ¥ ; a | 
-5l mi 


Further substituting Equation (2) for n(m) in terms of F(x), and Equation (1) for F(x), gives the formula shown 
at the beginning of this section. 


Second Formula 


Willans then gives another formula: 


Pa = y l m Pf mi y| 2-iem - al | 


Here, F and 7 are the functions used in his first formula. Thus, mF(m) = m if m is prime or 1, and 0 otherwise. 
The third factor in the summand is the predicate n(m) =n. The summand is 0 except for one term, which is the 
nth prime. For example, 


1-1-0 + 2-1-0 + 3-1-0 + 4-0-0 + 5-1-0 + 6-0-0 + 7-1-1 
+ 8-0-1 4+9-0-1 + 10-0-1 + 11-1-04 ... +1600 
7. 


P4 


II 


Third Formula 


[3] 
Willans goes on to present another formula for the nth prime that does not use any "nonanalytic" functions 
such as floor and absolute value. He starts by noting that for x = 2, 3, ..., the function 


[3] This is my terminology, not Willans's. 


; l : : 
((x- 11)? _ jan integer + ~ when x is prime, 


x ; 
an integer 


= T 


when x is composite or |. 


The first part follows from 


((x—1)!)- _ ifr- l)! + l)-(tx- 1)! -1,1 
x x x 


and x divides (x - 1)! + 1, by Wilson's theorem. Thus, the predicate "x is prime," for x = is given by 


((x—1)!)- 


sintem 
A 


Hix) = 
sin?= 
x 


From this it follows that 


nd 
mm) = ¥ H(x), for m = 2,3,.... 
rT? 


This cannot be converted to a formula for p, by the methods used in the first two formulas, because they use the 


[4] 
floor function. Instead, Willans suggests the following formula for the predicate x < y, for x, y 21: 


4] We have slightly simplified his formula. 


LT (x, y) = sin( Z | where 
a j 


y= | 


e= [] (x-3). 
=Ü 


Thus, if x < y, e = x(x - 1)...(0)(-1)...(x - (y - 1)) = 0 so that LT(x, y) = sin(7/2) = 1. If x 2y, the product does 


not include 0, so e 21, so that LT(x, y) = sin((7/2) - (an even number)) = 0. 


Finally, as in the first of Willans's formulas, 


a 


p, = 2+ LT(nim), n). 


Written out in full, this is the rather formidable 


r - 7 
piira Lb 


© oo 
A= b m SE 


— 
ar 
I 
ft 
= 
Mis 
ji 
= 
bi A 
Pt 


Fourth Formula 


Willans then gives a formula for p,,,; in terms of p,,: 


Pa i 
Parl = 1+ p,, + Y [| fle, +7). 


i=lj=l 


where f(x) is the predicate "x is composite," for x =. that is, 


fix) = | costa ne] 
X 


Alternatively, one could use f(x) = 1 - H(x), to keep the formula free of floor functions. 


As an example of this formula, let p, = 7. Then, 


Pasi = P+ 7+ f(8) ABMA) +AU) 
ABMA) +... + ABA).. ACA) 
+7414 0-14 dbl 4 1-11-04... + 1-1-1-0-1-0-1 
11. 


16-3 Wormell's Formula 


C. P. Wormell [Wor] improves on Willans's formulas by avoiding both trigonometric functions and the floor 
function. Wormell's formula can in principle be evaluated by a simple computer program that uses only integer 


arithmetic. The derivation does not use Wilson's theorem. Wormell starts with, for x 2), 


x x l a positive integer, if x is prime, 
Bx) = TT JĮ aby? = 4*P z P 


am ibe Ü, if x is composite. 


Thus, the number of primes Sm is given by 


num) = s isc 


r=2 


because the summanzd is the predicate "x is prime." 
Observe that, for n 21, a >o, 


i] (l +a} 0, when a < M, 
-F+ a)i = 


rol a positive integer, when a 2 n. 


Repeating a trick above, the predicate a < n is 


IG -reak 
-i jx =l 
7 


(agn) = I= | 


Because 


Pp, = 2+ y (mrt) <r), 


n=l 


we have, upon factoring constants out of summations, 


Pn = 


Ie Deo 


As promised, Wormell's formula does not use trigonometric functions. However, as he points out, if the powers 
of -1 were expanded using (-1)" = coszn, they would reappear. 


16-4 Formulas for Other Difficult Functions 


Let us have a closer look at what Willans and Wormell have done. We postulate the rules below as defining 
what we mean by the class of functions that can be represented by "formulas," which we will call "formula 


functions." Here, x is shorthand for x4, X2, ..., X, for any n 1. The domain of values is the integers ... -2, -1, 


0, 1, 2, .... 
1. The constants ... -1, 0, 1, ... are formula functions. 


2. ZS 


The projection functions f(x ) = x; for 1 =i =n, are formula functions. 


3. The expressions x + y, x - y, and xy are formula functions, if x and y are. 


4. The class of formula functions is closed under composition (substitution). That is, f(g(x ), go(x ), 


.-- In(X )) is a formula function if f and g; are, for i = 1, ..., m. 


5. Bounded sums and products, written 


are formula functions, if a, b, and f are, and a(x ) Eb). 


Sums and products are required to be bounded to preserve the computational character of formulas; that is, 
formulas can be evaluated by plugging in values for the arguments and carrying out a finite number of 
calculations. The reason for the prime on the È and IT is explained later in this chapter. 


When forming new formula functions using composition, we supply parentheses when necessary according to 
well-established conventions. 


Notice that division is not included in the list above; that's too complicated to be uncritically accepted as a 
"formula function." Even so, the above list is not minimal. It might be fun to find a minimal starting point, but 
we won't dwell on that here. 


This definition of "formula function" is close to the definition of "elementary function" given in [Cut]. 
However, the domain of values used in [Cut] is the non-negative integers (as is usual in recursive function 


theory). Also, [Cut] requires the bounds on the iterated sum and product to be 0 and x - 1 (where x is a 


variable), and allows the range to be vacuous (in which case the sum is defined as 0 and the product is defined 
as 1). 


In what follows, we show that the class of formula functions is quite extensive, including most of the functions 
ordinarily encountered in mathematics. But it doesn't include every function that is easy to define and has an 
elementary character. 


Our development is slightly encumbered, compared to similar developments in recursive function theory, 
because here variables can take on negative values. However, the possibility of a value's being negative can 
often be accommodated by simply squaring some expression that would otherwise appear in the first power. 
Our insistence that iterated sums and products not be vacuous is another slight encumbrance. 


Here, a "predicate" is simply a 0/1-valued function, whereas in recursive function theory a predicate is a true/ 
false-valued function, and every predicate has an associated "characteristic function" that is 0/1-valued. We 
associate 1 with true and 0 with false, as is universally done in programming languages and in computers (in 
what their and and or instructions do); in logic and recursive function theory, the association is often the 
opposite. 


The following are formula functions: 


1. a= aa, aè = aaa, and so on. 
2. The predicate a =b: 


ju- by 
(a=b) = []'(1-/). 


j=0 


3. (ab) =1-(a=b). 


4. The predicate a =p: 


[a-b] 
Yia- b)=i) 
i= 0 
fa=h)? {a-b} i}? 
Zo PME- 


i= j=t 


(a>) 


We can now explain why we do not use the convention that a vacuous iterated sum/product has the value 0/1. If 
we did, we would have such shams as 


=(a— b}? h-1 
(a=b)= F l and (azb) = I] 0. 


fed isa 


The comparison predicates are key to everything that follows, and we don't wish to have them based on 
anything quite that artificial. 


5. (a>b)=(a Žb +1). 

6. (a Šb)=(b Ža). 

7. (a<b)=(b>a). 

8. jal = (2(a 20) - 1a. 

9. max(a, b) = (a =b) (a - b) + b. 


10. min(a, b) = (a >p) (b-a)+a. 


Now we can fix the iterated sums and products so that they give the conventional and useful result when the 
range is vacuous. 


Mi maxla{ih HY 
¥ A) = (b(X) 2 a(X)) SAGX). 


11. Í= ait} i= aij 


Wy) master Th, HLEN 
ALD = 14+ (6G) zaal [TA 


12 P= aii} f= adit} 


From now on we will use X and IT without the prime. All functions thus defined are total (defined for all values 
of the arguments). 


< 


This gives n! = 1 for n £0. 

In what follows, P and Q denote predicates. 

14. =P = 1 - P(x). 

15. P&D & Q(x) = P(x JQ(x). 

16. POX) | QO) = 1 = (L -PRC - Q0). 

17. PO) Baw = (PO) - AK”. 

18. if P(x”) then f(y") else gz) = POY) + (1 - POG. 


n 
a" = ifn 20 then Tle else O. 
19. isl 


This gives, arbitrarily and perhaps incorrectly for a few cases, the result 0 for n < 0, and the result 1 for 
0°. 


(ms Vxen)P(x,¥) = [[ Pix F). 
20. v=m 


(a <ay<njP(x,F) = l Il (1-P(x,¥)). 
2i, ram 


Vis vacuously true, d is vacuously false. 


{m £ minx Ss a)P(x, ¥) = m+ y i (1 -= Pij ¥)). 


272, EL yan 


The value of this expression is the least x in the range m to n such that the predicate is true, or m if the 
range is vacuous, or n + 1 if the predicate is false throughout the (nonvacuous) range. The operation is 
called "bounded minimalization" and it is a very powerful tool for developing new formula functions. 
It is a sort of functional inverse, as illustrated by the next formula. That minimalization can be done by 
a sum of products is due to Goodstein [Good]. 


v3 Ln] = (02min k<|n])((k + 1)? >n). 


This is the "integer square root" function, which we define to be 0 for n < 0, just to make it a total 
function. 


24, din =(-In) Scdq |nl)\(n = qa). 
This is the "d divides n" predicate, according to which 0|0 but —(0|n) for n Fo. 


25. n+d=ifn =0 then (-n Smin q Sn (0 <4, Sid - 1)(n = qd + r) else (n Smin q S-ny(-Id +I 
<4 Son =qd+r). 


This is the conventional truncating form of integer division. For d = 0 it gives a result of |n| + 1, 
arbitrarily. 


26. rem(n, d) =n - (n + d)d. 


This is the conventional remainder function. If rem(n, d) is nonzero, it has the sign of the numerator n. 
If d = 0, the remainder is n. 


27. isprime(n) =n = & =(2 <J Sin - 1)(d/n). 


Tin) = y isprime(s). 
28. isl 


< 


(Number of primes =n.) 


29. p, = (1 mink S2"y(p(k) = n). 


30. exponent(p, n) = (0 Smin x Sin))-(px + 1n). 
This is the exponent of a given prime factor p of n, for n 1. 


31. Forn >o. 


32. The nth digit after the decimal point in the decimal expansion of 


J2: rem(|/2- 102" |, 10). 


Thus, the class of formula functions is quite large. It is limited, though, by the following theorem (at least): 


Theorem. If f is a formula function, then there is a constant k such that 


where there are k 2's. 


This can be proved by showing that each application of one of the rules 1-5 (on page 279) preserves the 
theorem. For example, if f(x ) = c (rule 1), then for some h, 
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Similarly, it can be shown that the theorem holds for f(x, y) = xy. 
The proofs that rules 4 and 5 preserve the theorem are a bit tedious but not difficult, and are omitted. 
From the theorem, it follows that the function 


Equation 4 


fix) = 22°" hx 


is not a formula function, because for sufficiently large x, Equation (4) exceeds the value of the same 
expression with any fixed number k of 2's. 


For those interested in recursive function theory, we point out that Equation (4) is primitive recursive. 
Furthermore, it is easy to show directly from the definition of primitive recursion that formula functions are 
primitive recursive. Therefore, the class of formula functions is a proper subset of the primitive recursive 
functions. The interested reader is referred to [Cut]. 


In summary, this section shows that not only is there a formula in elementary functions for the nth prime, but 
also for a good many other functions encountered in mathematics. Furthermore, our "formula functions" are not 
based on trigono-metric functions, the floor function, absolute value, powers of -1, or even division. The only 
sneaky maneuver is to use the fact that the product of a lot of numbers is 0 if any one of them is 0, which is 
used in the formula for the predicate a = b 


It is true, however, that once you see them, they are not interesting. The quest for "interesting" formulas for 
primes should go on. For example, [Rib] cites the amazing theorem of W. H. Mills (1947) that there exists a 0 


such that the expression 


Le* J 


is prime-valued for all n 21. Actually, there are an infinite number of such values (e.g., 1.3063778838* and 
1.4537508625483*). Furthermore, there is nothing special about the "3"; the theorem is true if the 3 is replaced 
with any integer 23 (for different values of 0). Better yet, the 3 can be replaced with 2 if it is true that there is 


always a prime between n? and (n + 1)2, which is almost certainly true, but has never been proved. And 
furthermore, ... well, the interested reader is referred to [Rib] and to [Dud] for more fascinating formulas of 


this type. 


Appendix A. Arithmetic Tables for a 4-Bit Machine 


In the tables in Appendix A, underlining denotes signed overflow. 


Table A-1.. Addition 
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The table for subtraction (Table A-2) assumes that the carry bit for a - b is set as it would be fora +b +1, so 
that carry is equivalent to "not borrow." 


Table A-2.. Subtraction (row - column) 
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For multiplication (Tables A-3 and A-4), overflow means that the result cannot be expressed as a 4-bit quantity. 
For signed multiplication (Table A-3), this is equivalent to the first five bits of the 8-bit result not being all 1's 


or all 0's. 


12 


15 


Table A-3.. Signed Multiplication 
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Tables A-5 and A-6 are for conventional truncating division. Table A-5 shows a result of 8 with overflow for 
the case of the maximum negative number divided by -1, but on most machines the result in this case is 
undefined, or the operation is suppressed. 


Table A-5.. Signed Short Division (row + column) 
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Table A-6.. Unsigned Short Division (row + column) 


Tables A-7 and A-8 give the remainder associated with conventional truncating division. Table A-7 shows a 
result of 0 for the case of the maximum negative number divided by -1, but on most machines the result for this 
case is undefined, or the operation is suppressed. 


Table A-7.. Remainder for Signed Short Division (row + column) 
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Appendix B. Newton's Method 


To review Newton's method very briefly, we are given a differentiable function f of a real variable x and we 
wish to solve the equation f(x) = 0 for x. Given a current estimate x,, of a root of f, Newton's method gives us a 


better estimate x,, + 1, under suitable conditions, according to the formula 


Here, f'(x,) is the derivative of f at x = x,. The derivation of this formula can be read off the figure below (solve 


for xp + 4): 


| crsca ie = 


The method works very well for simple, well-behaved functions such as polynomials, provided the first 
estimate is quite close. Once an estimate is sufficiently close, the method converges quadratically. That is, if r 
is the exact value of the root, and x,, is a sufficiently close estimate, then 


A 


oes T r| Š (x, = ry’. 


Thus, the number of digits of accuracy doubles with each iteration (e.g., if 


|x,- r| $0,001, then |x,,, — 7] £ 0.000001). 


If the first estimate is way off, then the iterations may converge very slowly, may diverge to infinity, may 
converge to a root other than the one closest to the first estimate, or may loop among certain values indefinitely. 


woe 


This discussion has been quite vague because of phrases like "suitable conditions," "well-behaved," and 
"sufficiently close." For a more precise discussion, consult almost any first-year calculus textbook. 


In spite of the caveats surrounding this method, it is occasionally useful in the domain of integers. To see 
whether or not the method applies to a particular function, you have to work it out, such as is done in Section 


11-1, "Integer Square Root," on page 203. 


Table B-1 gives a few iterative formulas derived from Newton's method, for computing certain numbers. The 


first column shows the number it is desired to compute. The second column shows a function that has that 
number as a root. The third column shows the right-hand side of Newton's formula corresponding to that 
function. 


It is not always easy, incidentally, to find a good function to use. There are, of course, many functions that have 
the desired quantity as a root, and only a few of them lead to a useful iterative formula. Usually, the function to 


use is a sort of inverse of the desired computation. For example, to find a use f(x) = x2 - a; to find loga use f 


[1] 


(x) = 2X - a, and so on. 


[t] Newton's method for the special case of the square root function was known to Babylonians about 4,000 years ago. 


Table B-1.. Newton's Method for Computing Certain Numbers 


| Quantity to Be Computed | Function | Iterative Formula 


The iterative formula for log, a converges (to log, a) even if the multiplier 1/In2 is altered somewhat (for 


example, to 1, or to 2). However, it then converges more slowly. A value of 3/2 or 23/16 might be useful in 
some applications (1/In2 1.4427). 
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